Skip To Main Content
backBack to Search

Senior Site Reliability Engineer

Remote in Colombia
Site Reliability Engineering
& 6 others

We are looking for a skilled Senior Site Reliability Engineer to join our team supporting EPAM's Compute Managed Services project.

In this role, you will ensure operational stability and 24x7 monitoring across multi-cloud environments, drive automation and observability improvements, and collaborate with cross-functional teams to deliver reliable compute services. If you are passionate about maintaining high-quality cloud platforms and enjoy working in a dynamic environment, we encourage you to apply.

Responsibilities
  • Perform 24x7 monitoring of compute platforms using tools such as ELK and PagerDuty
  • Manage incidents and problems across servers, middleware, operating systems, and cloud platforms including troubleshooting, root cause analysis, and resolution
  • Execute repaving activities, change management, and disaster recovery procedures
  • Ensure security and vulnerability compliance including user management and certificate lifecycle oversight
  • Handle service requests, configuration updates, and prepare audit-related data extracts
  • Develop and maintain Standard Operating Procedures for infrastructure operations
  • Collaborate with teams to implement cell-based automation and continuous service improvements
  • Drive observability enhancements and automate operational processes
  • Maintain compliance with security standards and best practices
  • Support on-call duties and provide operational support overlapping US, UK, and AU business hours as required
Requirements
  • Experience of 3+ years in cloud platforms including GCP, AWS, and Azure
  • Proficient in operating system administration for Windows and Linux environments
  • Strong knowledge of automation tools such as Ansible, Terraform, Python, and Bash scripting
  • Experience with observability tools like ELK Stack and Grafana
  • Familiarity with incident management processes and root cause analysis
  • Knowledge of security hardening, vulnerability management, and compliance requirements
  • Excellent problem-solving and analytical skills
  • Effective communication and collaboration skills
  • Experience with disaster recovery and operational recovery processes
  • Upper-Intermediate English language proficiency (B2)