Skip To Main Content
backBack to Search

Lead Site Reliability Engineer

Remote in Argentina, Mexico
Site Reliability Engineering
& 16 others

Join our team as a Lead Site Reliability Engineer dedicated to providing advanced support for critical Azure-based systems.

You will address complex cloud challenges, enhance system observability, and strengthen reliability using Kubernetes, monitoring platforms, and Infrastructure-as-Code. If cloud reliability excites you and collaboration across teams inspires you, apply now to contribute to our innovative projects.

Responsibilities
  • Resolve complex incidents to ensure system availability
  • Maintain reliability and performance of Azure-based enterprise infrastructure
  • Deploy observability, monitoring, and logging tools
  • Automate infrastructure management with Terraform and scripting technologies
  • Improve system performance and uptime through centralized monitoring
  • Collaborate with multiple teams to enhance service reliability
  • Perform root cause analysis and oversee postmortems for incidents
  • Configure deployment pipelines in Azure DevOps for secure workflows
  • Write and maintain automation scripts for incident recovery and recurring tasks
  • Enhance monitoring frameworks with platforms like Prometheus and Grafana
  • Respond promptly to incidents to meet SLA expectations
  • Facilitate integration of monitoring data from Azure and AWS environments
  • Advance service reliability and observability practices continuously
  • Document processes and incident resolutions thoroughly
  • Take part in Agile team events and balance task priorities
Requirements
  • Minimum 5 years’ expertise in site reliability engineering or comparable DevOps roles
  • 1+ years of demonstrated leadership experience
  • Knowledge of Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQL
  • Expertise in infrastructure automation using Azure DevOps and Terraform
  • Proficiency in scripting languages such as Bash, PowerShell, and Python
  • Skills in monitoring tools including Prometheus and Grafana
  • Background in incident management and ITSM processes with analytical capability for root cause investigations
  • Competency in resolving technical challenges promptly in high-pressure situations
  • Experience in Agile workflows and fast-paced operational environments
  • Flexibility to communicate effectively in written and verbal formats for teamwork and documentation
  • Capability to configure alerts that prevent SLA breaches proactively
  • Understanding of cloud scaling techniques and security best practices
  • Knowledge of Kubernetes administration for orchestration tasks
  • Ability to collaborate with diverse functional teams seamlessly
  • English proficiency of B2 or higher
Nice to have
  • Background in AWS services, such as EKS, RDS, CloudWatch, and X-Ray
  • Familiarity with distributed logging systems and tools for incident automation
  • Certifications such as Microsoft Azure Administrator or AWS Certified DevOps Engineer
  • Understanding of Kubernetes configurations for scaling and advanced networking setups
  • Proficiency in observability tools such as OpenSearch for AWS environments
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn