Senior Site Reliability Engineer
Argentina
Join our team as a Senior Site Reliability Engineer focused on delivering advanced support for critical Azure-based systems.
You will troubleshoot complex cloud environments, enhance observability, and implement reliability solutions using Kubernetes, monitoring tools, and Infrastructure-as-Code. If you are passionate about cloud reliability and enjoy collaborating across teams, apply now to contribute to our cutting-edge projects.
Responsibilities
- Troubleshoot and resolve complex incidents to maintain system uptime
- Ensure reliability and performance of Azure-based enterprise infrastructure
- Implement observability, monitoring, and logging solutions
- Automate infrastructure provisioning and deployment using Terraform and scripting
- Optimize system performance and uptime through proactive monitoring and alerting
- Collaborate with cross-functional teams to improve service reliability
- Conduct root cause analysis and postmortems for incident management
- Manage deployment pipelines in Azure DevOps for secure and scalable workflows
- Develop and maintain automation scripts for routine tasks and incident recovery
- Enhance monitoring frameworks with tools like Prometheus and Grafana
- React quickly to incidents to avoid SLA degradation
- Integrate monitoring data from Azure and AWS environments
- Support continuous improvement of service reliability and observability practices
- Document technical processes and incident reports
- Participate in Agile team activities and prioritize competing tasks
Requirements
- Minimum 3 years of experience in site reliability engineering or related DevOps roles
- Hands-on experience with Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQL
- Strong expertise in Azure DevOps and Terraform for infrastructure automation
- Proficient scripting skills in Bash, PowerShell, and Python
- Experience with monitoring and observability tools such as Prometheus and Grafana
- Solid background in incident management and ITSM processes with root cause analysis capabilities
- Ability to troubleshoot and debug complex technical issues in real-time
- Experience working in fast-paced Agile environments
- Strong verbal and written communication skills for collaboration and reporting
- Proactive approach to setting alerts and preventing SLA degradation
- Experience with cloud infrastructure scaling and security best practices
- Knowledge of Kubernetes administration and orchestration
- Ability to collaborate effectively with cross-functional teams
- English language proficiency at B2 level or above
Nice to have
- Hands-on experience with AWS services including EKS, RDS, CloudWatch, and X-Ray
- Familiarity with distributed logging pipelines and incident automation tools
- Knowledge of advanced Kubernetes use cases for scaling and network configurations
- Certifications such as Microsoft Azure Administrator or AWS Certified DevOps Engineer
- Experience with observability tools like OpenSearch for AWS workloads
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn