Lead Site Reliability Engineer
Argentina
Join our team as a Lead Site Reliability Engineer dedicated to providing advanced support for critical Azure-based systems.
You will address complex cloud challenges, enhance system observability, and strengthen reliability using Kubernetes, monitoring platforms, and Infrastructure-as-Code. If cloud reliability excites you and collaboration across teams inspires you, apply now to contribute to our innovative projects.
Responsibilities
- Resolve complex incidents to ensure system availability
- Maintain reliability and performance of Azure-based enterprise infrastructure
- Deploy observability, monitoring, and logging tools
- Automate infrastructure management with Terraform and scripting technologies
- Improve system performance and uptime through centralized monitoring
- Collaborate with multiple teams to enhance service reliability
- Perform root cause analysis and oversee postmortems for incidents
- Configure deployment pipelines in Azure DevOps for secure workflows
- Write and maintain automation scripts for incident recovery and recurring tasks
- Enhance monitoring frameworks with platforms like Prometheus and Grafana
- Respond promptly to incidents to meet SLA expectations
- Facilitate integration of monitoring data from Azure and AWS environments
- Advance service reliability and observability practices continuously
- Document processes and incident resolutions thoroughly
- Take part in Agile team events and balance task priorities
Requirements
- Minimum 5 years’ expertise in site reliability engineering or comparable DevOps roles
- 1+ years of demonstrated leadership experience
- Knowledge of Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQL
- Expertise in infrastructure automation using Azure DevOps and Terraform
- Proficiency in scripting languages such as Bash, PowerShell, and Python
- Skills in monitoring tools including Prometheus and Grafana
- Background in incident management and ITSM processes with analytical capability for root cause investigations
- Competency in resolving technical challenges promptly in high-pressure situations
- Experience in Agile workflows and fast-paced operational environments
- Flexibility to communicate effectively in written and verbal formats for teamwork and documentation
- Capability to configure alerts that prevent SLA breaches proactively
- Understanding of cloud scaling techniques and security best practices
- Knowledge of Kubernetes administration for orchestration tasks
- Ability to collaborate with diverse functional teams seamlessly
- English proficiency of B2 or higher
Nice to have
- Background in AWS services, such as EKS, RDS, CloudWatch, and X-Ray
- Familiarity with distributed logging systems and tools for incident automation
- Certifications such as Microsoft Azure Administrator or AWS Certified DevOps Engineer
- Understanding of Kubernetes configurations for scaling and advanced networking setups
- Proficiency in observability tools such as OpenSearch for AWS environments
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn