Skip To Main Content
backBack to Search

Senior Site Reliability Engineer

Site Reliability Engineering, Dynatrace, GitHub, Splunk, Kubernetes, DevOps, Cloud, CI/CD

We are seeking a highly skilled Senior Site Reliability Engineer to join our remote team, contributing to the continuous improvement of our DevOps and SRE practices. As a Senior SRE, you will be responsible for driving discussions for technology roadmap for the SRE team. You will be instrumental in designing, developing, and managing monitoring, alerting, operability, and observability for applications using Dynatrace, Splunk & Grafana, and enforcing application teams to meet performance and availability SLAs.

Responsibilities
  • Help implement DevOps & SRE practices, driving discussions for technology roadmap for SRE team
  • Identify, craft, and maintain SLIs and SLOs for teams, as well as metrics such as MTTR, Lead time for change, Deployment Frequency, and Change Failure Rate
  • Design, develop and manage monitoring, alerting, operability, and observability for applications using Dynatrace, Splunk & Grafana
  • Enforce application teams to meet performance and availability SLAs
  • Partner with product owners to manage error budget, prioritize toil backlog and validate against team, application, and incident metrics
  • Be part of an on-call rotation for production events or outages
  • Strive for continuous improvement for continuous integration & continuous deployment (CI/CD Pipeline)
  • Troubleshoot techniques, incident management and root cause analysis
  • Encourage and build automated processes wherever possible
  • Implement cybersecurity measures by continuously performing vulnerability assessment and risk management
  • Manage periodic reporting on the progress to the management and the customer
  • Work in partnership with application teams to ease their adoption of the platform
  • Coordinate and communicate within the team and with customers
  • Perform system analysis of the current system in use and develop plans for enhancements and improvements
Requirements
  • Minimum of 3 years of experience in Site Reliability Engineering, demonstrating your expertise in DevOps and SRE practices
  • Experience with Dynatrace, GitHub, Splunk, Kubernetes, and Cloud technologies, enabling you to design and manage monitoring, alerting, operability, and observability for applications
  • Strong knowledge of CI/CD pipeline and DevOps methodologies, showcasing your ability to drive continuous improvement for the team
  • Strong analytical and problem-solving skills, allowing you to troubleshoot issues and perform root cause analysis
  • Excellent communication and interpersonal skills, enabling you to collaborate effectively with cross-functional teams and customers
  • Experience in cybersecurity measures, including vulnerability assessment and risk management
  • Fluent spoken and written English at an Upper-Intermediate level or higher
Nice to have
  • Experience with other DevOps and SRE tools, such as Jenkins, Grafana
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn