Senior Site Reliability Engineer

Site Reliability Engineering

Sorry, this position is no longer available

Location-specific conditions & benefits*

Colombia

We're on the lookout for a highly talented Senior Site Reliability Engineer to join our remote team and engage in thrilling initiatives leveraging state-of-the-art technologies.

As a Senior SRE, your role involves crafting, adjusting, and resolving modules for resource deployment through Terraform and Infrastructure as Code paradigms. Dive into stories designated in Azure DevOps within agile workflows, innovate observability and monitoring features, and script application monitoring alongside alerting. Conduct stress tests, streamline manual processes in CI/CD pipelines, and devise as well as automate Playbooks and Alerts to facilitate Auto Healing.

Responsibilities

Devise, tweak, and troubleshoot modules for resource deployment utilizing Terraform and Infrastructure as Code principles
Tackle assigned Stories within Azure DevOps following agile methodologies
Foster novel observability and monitoring functionalities and visualizations
Script application monitoring and set up alerting mechanisms
Conduct and execute stress tests
Automate manual procedures within CI/CD pipelines
Inaugurate and automate Playbooks when absent
Automate Alerts to enable Auto Healing as specified by each program
Collaborate with interdisciplinary teams to furnish top-notch software solutions aligned with project objectives and timelines
Ensure the establishment and sustenance of infrastructures leveraging Infrastructure as Code principles and tools
Provide guidance and mentorship to junior team members, cultivating a culture of growth and continuous learning within the team

Requirements

Minimum of 3 years immersed in Site Reliability Engineering, steering intricate cloud and microservices ecosystems
Adept in Azure DevOps as the primary CI/CD tool
Proficiency in Kubernetes, coupled with a sound grasp of Helm, Istio, and Google Cloud Platform
Competence in Terraform, ARM, and Infrastructure as Code principles for streamlined and scalable infrastructure administration
Robust comprehension of Linux OS and proficiency in scripting languages such as Bash and PowerShell
Familiarity with Observability toolsets like Prometheus and Grafana, complemented by a grasp of SLI/SLO concepts
Substantial experience in at least one programming language, be it Go or Python
Exceptional problem-solving and analytical abilities, facilitating effective decision-making in intricate settings
Advanced proficiency in the English language (Upper-Intermediate level) for seamless communication and collaboration with the team and stakeholders

Nice to have

Working familiarity with Golang and Angular for efficient application development
Hands-on experience with Google Cloud/OpenShift for cloud infrastructure design, deployment, and management
Knowledge of Jaeger, Kiali, and Loki for efficient observability and monitoring

Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Senior Site Reliability Engineer

These jobs are for you