Senior Site Reliability Engineer

Site Reliability Engineering, Dynatrace, GitHub, Splunk, Kubernetes, DevOps, Cloud, CI/CD

We are seeking a highly skilled Senior Site Reliability Engineer to join our remote team, contributing to the continuous improvement of our DevOps and SRE practices. As a Senior SRE, you will be responsible for driving discussions for technology roadmap for the SRE team. You will be instrumental in designing, developing, and managing monitoring, alerting, operability, and observability for applications using Dynatrace, Splunk & Grafana, and enforcing application teams to meet performance and availability SLAs.

Responsibilities

Help implement DevOps & SRE practices, driving discussions for technology roadmap for SRE team
Identify, craft, and maintain SLIs and SLOs for teams, as well as metrics such as MTTR, Lead time for change, Deployment Frequency, and Change Failure Rate
Design, develop and manage monitoring, alerting, operability, and observability for applications using Dynatrace, Splunk & Grafana
Enforce application teams to meet performance and availability SLAs
Partner with product owners to manage error budget, prioritize toil backlog and validate against team, application, and incident metrics
Be part of an on-call rotation for production events or outages
Strive for continuous improvement for continuous integration & continuous deployment (CI/CD Pipeline)
Troubleshoot techniques, incident management and root cause analysis
Encourage and build automated processes wherever possible
Implement cybersecurity measures by continuously performing vulnerability assessment and risk management
Manage periodic reporting on the progress to the management and the customer
Work in partnership with application teams to ease their adoption of the platform
Coordinate and communicate within the team and with customers
Perform system analysis of the current system in use and develop plans for enhancements and improvements

Requirements

Minimum of 3 years of experience in Site Reliability Engineering, demonstrating your expertise in DevOps and SRE practices
Experience with Dynatrace, GitHub, Splunk, Kubernetes, and Cloud technologies, enabling you to design and manage monitoring, alerting, operability, and observability for applications
Strong knowledge of CI/CD pipeline and DevOps methodologies, showcasing your ability to drive continuous improvement for the team
Strong analytical and problem-solving skills, allowing you to troubleshoot issues and perform root cause analysis
Excellent communication and interpersonal skills, enabling you to collaborate effectively with cross-functional teams and customers
Experience in cybersecurity measures, including vulnerability assessment and risk management
Fluent spoken and written English at an Upper-Intermediate level or higher

Nice to have

Experience with other DevOps and SRE tools, such as Jenkins, Grafana

Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn