Skip To Main Content
backBack to Search

Senior Site Reliability Engineer

Remote in Argentina, Mexico
Site Reliability Engineering
& 16 others

Join our team as a Senior Site Reliability Engineer focused on delivering advanced support for critical Azure-based systems.

You will troubleshoot complex cloud environments, enhance observability, and implement reliability solutions using Kubernetes, monitoring tools, and Infrastructure-as-Code. If you are passionate about cloud reliability and enjoy collaborating across teams, apply now to contribute to our cutting-edge projects.

Responsibilities
  • Troubleshoot and resolve complex incidents to maintain system uptime
  • Ensure reliability and performance of Azure-based enterprise infrastructure
  • Implement observability, monitoring, and logging solutions
  • Automate infrastructure provisioning and deployment using Terraform and scripting
  • Optimize system performance and uptime through proactive monitoring and alerting
  • Collaborate with cross-functional teams to improve service reliability
  • Conduct root cause analysis and postmortems for incident management
  • Manage deployment pipelines in Azure DevOps for secure and scalable workflows
  • Develop and maintain automation scripts for routine tasks and incident recovery
  • Enhance monitoring frameworks with tools like Prometheus and Grafana
  • React quickly to incidents to avoid SLA degradation
  • Integrate monitoring data from Azure and AWS environments
  • Support continuous improvement of service reliability and observability practices
  • Document technical processes and incident reports
  • Participate in Agile team activities and prioritize competing tasks
Requirements
  • Minimum 3 years of experience in site reliability engineering or related DevOps roles
  • Hands-on experience with Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQL
  • Strong expertise in Azure DevOps and Terraform for infrastructure automation
  • Proficient scripting skills in Bash, PowerShell, and Python
  • Experience with monitoring and observability tools such as Prometheus and Grafana
  • Solid background in incident management and ITSM processes with root cause analysis capabilities
  • Ability to troubleshoot and debug complex technical issues in real-time
  • Experience working in fast-paced Agile environments
  • Strong verbal and written communication skills for collaboration and reporting
  • Proactive approach to setting alerts and preventing SLA degradation
  • Experience with cloud infrastructure scaling and security best practices
  • Knowledge of Kubernetes administration and orchestration
  • Ability to collaborate effectively with cross-functional teams
  • English language proficiency at B2 level or above
Nice to have
  • Hands-on experience with AWS services including EKS, RDS, CloudWatch, and X-Ray
  • Familiarity with distributed logging pipelines and incident automation tools
  • Knowledge of advanced Kubernetes use cases for scaling and network configurations
  • Certifications such as Microsoft Azure Administrator or AWS Certified DevOps Engineer
  • Experience with observability tools like OpenSearch for AWS workloads
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn