Senior Site Reliability Engineer
Remote in Colombia
Site Reliability Engineering
& 6 others
We are looking for a skilled Senior Site Reliability Engineer to join our team supporting EPAM's Compute Managed Services project.
In this role, you will ensure operational stability and 24x7 monitoring across multi-cloud environments, drive automation and observability improvements, and collaborate with cross-functional teams to deliver reliable compute services. If you are passionate about maintaining high-quality cloud platforms and enjoy working in a dynamic environment, we encourage you to apply.
Responsibilities
- Perform 24x7 monitoring of compute platforms using tools such as ELK and PagerDuty
- Manage incidents and problems across servers, middleware, operating systems, and cloud platforms including troubleshooting, root cause analysis, and resolution
- Execute repaving activities, change management, and disaster recovery procedures
- Ensure security and vulnerability compliance including user management and certificate lifecycle oversight
- Handle service requests, configuration updates, and prepare audit-related data extracts
- Develop and maintain Standard Operating Procedures for infrastructure operations
- Collaborate with teams to implement cell-based automation and continuous service improvements
- Drive observability enhancements and automate operational processes
- Maintain compliance with security standards and best practices
- Support on-call duties and provide operational support overlapping US, UK, and AU business hours as required
Requirements
- Experience of 3+ years in cloud platforms including GCP, AWS, and Azure
- Proficient in operating system administration for Windows and Linux environments
- Strong knowledge of automation tools such as Ansible, Terraform, Python, and Bash scripting
- Experience with observability tools like ELK Stack and Grafana
- Familiarity with incident management processes and root cause analysis
- Knowledge of security hardening, vulnerability management, and compliance requirements
- Excellent problem-solving and analytical skills
- Effective communication and collaboration skills
- Experience with disaster recovery and operational recovery processes
- Upper-Intermediate English language proficiency (B2)