Senior Site Reliability Engineer

Site Reliability Engineering

Location-specific conditions & benefits*

Argentina

Join our team as a Senior Site Reliability Engineer focused on delivering advanced support for critical Azure-based systems.

You will troubleshoot complex cloud environments, enhance observability, and implement reliability solutions using Kubernetes, monitoring tools, and Infrastructure-as-Code. If you are passionate about cloud reliability and enjoy collaborating across teams, apply now to contribute to our cutting-edge projects.

Responsibilities

Troubleshoot and resolve complex incidents to maintain system uptime
Ensure reliability and performance of Azure-based enterprise infrastructure
Implement observability, monitoring, and logging solutions
Automate infrastructure provisioning and deployment using Terraform and scripting
Optimize system performance and uptime through proactive monitoring and alerting
Collaborate with cross-functional teams to improve service reliability
Conduct root cause analysis and postmortems for incident management
Manage deployment pipelines in Azure DevOps for secure and scalable workflows
Develop and maintain automation scripts for routine tasks and incident recovery
Enhance monitoring frameworks with tools like Prometheus and Grafana
React quickly to incidents to avoid SLA degradation
Integrate monitoring data from Azure and AWS environments
Support continuous improvement of service reliability and observability practices
Document technical processes and incident reports
Participate in Agile team activities and prioritize competing tasks

Requirements

Minimum 3 years of experience in site reliability engineering or related DevOps roles
Hands-on experience with Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQL
Strong expertise in Azure DevOps and Terraform for infrastructure automation
Proficient scripting skills in Bash, PowerShell, and Python
Experience with monitoring and observability tools such as Prometheus and Grafana
Solid background in incident management and ITSM processes with root cause analysis capabilities
Ability to troubleshoot and debug complex technical issues in real-time
Experience working in fast-paced Agile environments
Strong verbal and written communication skills for collaboration and reporting
Proactive approach to setting alerts and preventing SLA degradation
Experience with cloud infrastructure scaling and security best practices
Knowledge of Kubernetes administration and orchestration
Ability to collaborate effectively with cross-functional teams
English language proficiency at B2 level or above

Nice to have

Hands-on experience with AWS services including EKS, RDS, CloudWatch, and X-Ray
Familiarity with distributed logging pipelines and incident automation tools
Knowledge of advanced Kubernetes use cases for scaling and network configurations
Certifications such as Microsoft Azure Administrator or AWS Certified DevOps Engineer
Experience with observability tools like OpenSearch for AWS workloads

Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn