Middle Site Reliability Engineer
Remote in Colombia
Site Reliability Engineering
& 11 others

Sorry, this position is no longer available
Colombia
We are looking for a highly skilled Middle Site Reliability Engineer to join our remote team and work on an exciting project with leading-edge technologies and tools.
As a Middle Site Reliability Engineer, you will be responsible for deploying resources using Terraform, and IaC, and creating new observability and monitoring capabilities. You will also program application monitoring and alerting, prepare and perform stress tests, and automate manual processes in the CI/CD pipelines.
Responsibilities
- Write, modify, and troubleshoot modules to deploy resources using Terraform, IaC
- Work on Stories assigned in Azure DevOps as per the agile process
- Create new observability and monitoring capabilities and visualizations
- Program applications monitoring and alerting
- Prepare and perform stress tests
- Automate manual processes in the CI/CD pipelines
- Create Playbooks if they don't exist, and automate them
- Automate alerts to Auto Healing as defined with each Program
- Collaborate with cross-functional teams to deliver high-quality software solutions in line with project goals and timelines
- Continuously evaluate industry trends and best practices to refine and implement the most effective Site Reliability strategies
Requirements
- Minimum of 2 years of experience as a Site Reliability Engineer, working on large-scale projects and complex infrastructures
- Experience with Azure and Microsoft Azure services for cloud infrastructure design, deployment, and management
- Advanced experience with Kubernetes and a good understanding of Helm as well
- Expertise in Azure DevOps as the main CI/CD tool, with a focus on automation and efficiency
- Experience with Terraform, ARM, and IaC, ensuring efficient and scalable infrastructure management
- Understanding of Linux and scripting (bash/PowerShell) for automation purposes
- Experience with Prometheus, and Grafana to ensure optimal system performance and reliability
- Understanding of SLI/SLO concept for effective monitoring and alerting
- Upper-intermediate English language skills, allowing for effective written and spoken communication and collaboration with the team and stakeholders
Nice to have
- Working knowledge of Golang and Angular
- Experience with Google Cloud/OpenShift
- Python scripting language for automation purposes
- Knowledge of Jaeger, Kiali, and Loki for effective monitoring and observability
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn