Middle Site Reliability Engineer

Site Reliability Engineering

Sorry, this position is no longer available

Location-specific conditions & benefits*

Colombia

We are looking for a highly skilled Middle Site Reliability Engineer to join our remote team and work on an exciting project with leading-edge technologies and tools.

As a Middle Site Reliability Engineer, you will be responsible for deploying resources using Terraform, and IaC, and creating new observability and monitoring capabilities. You will also program application monitoring and alerting, prepare and perform stress tests, and automate manual processes in the CI/CD pipelines.

Responsibilities

Write, modify, and troubleshoot modules to deploy resources using Terraform, IaC
Work on Stories assigned in Azure DevOps as per the agile process
Create new observability and monitoring capabilities and visualizations
Program applications monitoring and alerting
Prepare and perform stress tests
Automate manual processes in the CI/CD pipelines
Create Playbooks if they don't exist, and automate them
Automate alerts to Auto Healing as defined with each Program
Collaborate with cross-functional teams to deliver high-quality software solutions in line with project goals and timelines
Continuously evaluate industry trends and best practices to refine and implement the most effective Site Reliability strategies

Requirements

Minimum of 2 years of experience as a Site Reliability Engineer, working on large-scale projects and complex infrastructures
Experience with Azure and Microsoft Azure services for cloud infrastructure design, deployment, and management
Advanced experience with Kubernetes and a good understanding of Helm as well
Expertise in Azure DevOps as the main CI/CD tool, with a focus on automation and efficiency
Experience with Terraform, ARM, and IaC, ensuring efficient and scalable infrastructure management
Understanding of Linux and scripting (bash/PowerShell) for automation purposes
Experience with Prometheus, and Grafana to ensure optimal system performance and reliability
Understanding of SLI/SLO concept for effective monitoring and alerting
Upper-intermediate English language skills, allowing for effective written and spoken communication and collaboration with the team and stakeholders

Nice to have

Working knowledge of Golang and Angular
Experience with Google Cloud/OpenShift
Python scripting language for automation purposes
Knowledge of Jaeger, Kiali, and Loki for effective monitoring and observability

Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Middle Site Reliability Engineer

These jobs are for you