Skip To Main Content
backBack to Search

Senior Site Reliability Engineer

Site Reliability Engineering, Amazon Web Services, Bare metal, Grafana, Linux, Prometheus, UNIX shell scripting, Docker, Kubernetes, Python, Terraform, DevOps
warning.png
Sorry, this position is no longer available

We are looking for a Senior Site Reliability Engineer to join our remote team.

In this position, you will be responsible for the day-to-day operations of the massively scalable and highly available backend platform. You will also be running the hybrid infrastructure, automating routine operational tasks, and ensuring the smooth functioning of our production services.

Responsibilities
  • Ensure the smooth functioning of our production services, including day-to-day operations of the backend platform
  • Run and maintain the hybrid infrastructure, automating routine operational tasks
  • Monitor production services using Prometheus/InfluxDB, ELK, Grafana, and OpsGenie/PagerDuty
  • Troubleshoot software/hardware issues and deep dive to find out why servers are performing at a sub-par level
  • Manage production incidents and work with stakeholders to resolve issues and minimize the impact
  • Create and maintain comprehensive documentation of all operational procedures and processes
  • Collaborate with development teams to design scalable and reliable systems
  • Implement best practices for security and compliance
Requirements
  • Minimum of 3 years of experience in Linux system administration, preferably on Ubuntu
  • Minimum of 3 years of experience in production monitoring using Prometheus/InfluxDB, ELK, Grafana, and OpsGenie/PagerDuty
  • Experience in building golden images, troubleshooting software/hardware issues
  • Ability to deep dive to find out why servers are performing at a sub-par level
  • Proficiency in Python, shell scripting, or Ansible
  • Experience working on geo-distributed and highly available production services
  • Strong knowledge of IPv4 and IPv6
  • Familiarity with data center operations
  • Hands-on experience in monitoring and debugging
  • Strong networking troubleshooting skills
  • Fluent verbal and written communication skills in English (B2+ level)
Nice to have
  • Experience with Incident Management and SLA/SLO/SLI
  • Performance tuning experience
  • Proficiency in Terraform
  • Experience with Kubernetes/Docker
  • Familiarity with DevOps best practices
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

These jobs are for you