Lead Site Reliability Engineer

Site Reliability Engineering

Location-specific conditions & benefits*

Argentina

Join our team as a Lead Site Reliability Engineer dedicated to providing advanced support for critical Azure-based systems.

You will address complex cloud challenges, enhance system observability, and strengthen reliability using Kubernetes, monitoring platforms, and Infrastructure-as-Code. If cloud reliability excites you and collaboration across teams inspires you, apply now to contribute to our innovative projects.

Responsibilities

Resolve complex incidents to ensure system availability
Maintain reliability and performance of Azure-based enterprise infrastructure
Deploy observability, monitoring, and logging tools
Automate infrastructure management with Terraform and scripting technologies
Improve system performance and uptime through centralized monitoring
Collaborate with multiple teams to enhance service reliability
Perform root cause analysis and oversee postmortems for incidents
Configure deployment pipelines in Azure DevOps for secure workflows
Write and maintain automation scripts for incident recovery and recurring tasks
Enhance monitoring frameworks with platforms like Prometheus and Grafana
Respond promptly to incidents to meet SLA expectations
Facilitate integration of monitoring data from Azure and AWS environments
Advance service reliability and observability practices continuously
Document processes and incident resolutions thoroughly
Take part in Agile team events and balance task priorities

Requirements

Minimum 5 years’ expertise in site reliability engineering or comparable DevOps roles
1+ years of demonstrated leadership experience
Knowledge of Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQL
Expertise in infrastructure automation using Azure DevOps and Terraform
Proficiency in scripting languages such as Bash, PowerShell, and Python
Skills in monitoring tools including Prometheus and Grafana
Background in incident management and ITSM processes with analytical capability for root cause investigations
Competency in resolving technical challenges promptly in high-pressure situations
Experience in Agile workflows and fast-paced operational environments
Flexibility to communicate effectively in written and verbal formats for teamwork and documentation
Capability to configure alerts that prevent SLA breaches proactively
Understanding of cloud scaling techniques and security best practices
Knowledge of Kubernetes administration for orchestration tasks
Ability to collaborate with diverse functional teams seamlessly
English proficiency of B2 or higher

Nice to have

Background in AWS services, such as EKS, RDS, CloudWatch, and X-Ray
Familiarity with distributed logging systems and tools for incident automation
Certifications such as Microsoft Azure Administrator or AWS Certified DevOps Engineer
Understanding of Kubernetes configurations for scaling and advanced networking setups
Proficiency in observability tools such as OpenSearch for AWS environments

Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn