Senior Site Reliability Engineer

Site Reliability Engineering, Amazon Web Services, Bare metal, Grafana, Linux, Prometheus, UNIX shell scripting, Docker, Kubernetes, Python, Terraform, DevOps

Sorry, this position is no longer available

We are looking for a Senior Site Reliability Engineer to join our remote team.

In this position, you will be responsible for the day-to-day operations of the massively scalable and highly available backend platform. You will also be running the hybrid infrastructure, automating routine operational tasks, and ensuring the smooth functioning of our production services.

Responsibilities

Ensure the smooth functioning of our production services, including day-to-day operations of the backend platform
Run and maintain the hybrid infrastructure, automating routine operational tasks
Monitor production services using Prometheus/InfluxDB, ELK, Grafana, and OpsGenie/PagerDuty
Troubleshoot software/hardware issues and deep dive to find out why servers are performing at a sub-par level
Manage production incidents and work with stakeholders to resolve issues and minimize the impact
Create and maintain comprehensive documentation of all operational procedures and processes
Collaborate with development teams to design scalable and reliable systems
Implement best practices for security and compliance

Requirements

Minimum of 3 years of experience in Linux system administration, preferably on Ubuntu
Minimum of 3 years of experience in production monitoring using Prometheus/InfluxDB, ELK, Grafana, and OpsGenie/PagerDuty
Experience in building golden images, troubleshooting software/hardware issues
Ability to deep dive to find out why servers are performing at a sub-par level
Proficiency in Python, shell scripting, or Ansible
Experience working on geo-distributed and highly available production services
Strong knowledge of IPv4 and IPv6
Familiarity with data center operations
Hands-on experience in monitoring and debugging
Strong networking troubleshooting skills
Fluent verbal and written communication skills in English (B2+ level)

Nice to have

Experience with Incident Management and SLA/SLO/SLI
Performance tuning experience
Proficiency in Terraform
Experience with Kubernetes/Docker
Familiarity with DevOps best practices

Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Senior Site Reliability Engineer

These jobs are for you