Middle Site Reliability Engineer

Site Reliability Engineering

Sorry, this position is no longer available

Location-specific conditions & benefits*

Colombia

We're on the lookout for an exceptionally skilled Middle Site Reliability Engineer to become an integral part of our remote team, contributing to a captivating project that leverages cutting-edge technologies and tools.

As a Middle Site Reliability Engineer, your role involves deploying resources using Terraform and IaC, crafting innovative observability and monitoring features, programming application monitoring and alerting systems, executing stress tests, and automating manual procedures within the CI/CD pipelines.

Responsibilities

Craft, adapt, and resolve modules for deploying resources using Terraform and IaC
Tackle assigned Stories in Azure DevOps following agile methodologies
Generate novel observability features and compelling visualizations
Code applications monitoring and alerting mechanisms
Execute stress tests and ensure system resilience
Streamline manual processes within CI/CD pipelines
Establish Playbooks if absent and automate them
Implement automated alerts for Auto Healing, as per each Program's specifications
Engage collaboratively with cross-functional teams to deliver top-tier software solutions aligned with project objectives and deadlines
Continuously assess industry trends and best practices to enhance and implement potent Site Reliability strategies

Requirements

A minimum of 2 years of hands-on experience as a Site Reliability Engineer, specializing in extensive projects and intricate infrastructures
Proficiency in Azure and Microsoft Azure services, demonstrating expertise in cloud infrastructure design, deployment, and management
Advanced proficiency with Kubernetes and a solid grasp of Helm
Expertise in Azure DevOps as the primary CI/CD tool, emphasizing automation and efficiency
Competence with Terraform, ARM, and IaC, ensuring streamlined and scalable infrastructure management
Familiarity with Linux and scripting languages (bash/PowerShell) for automation purposes
Familiarity with Prometheus and Grafana to guarantee optimal system performance and reliability
Understanding of SLI/SLO concepts for efficient monitoring and alerting
Upper-intermediate proficiency in English, facilitating effective written and verbal communication and collaboration with the team and stakeholders

Nice to have

Working knowledge of Golang and Angular
Experience with Google Cloud/OpenShift
Proficiency in the Python scripting language for automation purposes
Knowledge of Jaeger, Kiali, and Loki for effective monitoring and observability

Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Middle Site Reliability Engineer

These jobs are for you