Back to Search
We are seeking a Senior Site Reliability Engineer to join our remote team.
As an SRE, you will be working closely with our distributed team, which runs our production operations. You will be responsible for the day-to-day operations of our massively scalable and highly available backend platform, running our hybrid infrastructure, automating routine operational tasks, leading individual projects, and achievements with abundant communication of progress.
Responsibilities
- Collaborate with teams to design, build, and maintain highly available and scalable infrastructure
- Ensure the reliability and uptime of our services and applications
- Automate routine operational tasks to improve efficiency and productivity
- Lead individual projects and achievements with abundant communication of progress
- Troubleshoot software or hardware issues, build golden images, and deep dive to find out why servers are performing at a sub-par level
- Analyze why services and sites are not working or blocked, and understand the workings of IPv4 and IPv6
- Troubleshoot network issues such as speeds not performing at optimal levels
- Effectively communicate with teams to troubleshoot, debug, and resolve issues in both production and non-production environments
Requirements
- A minimum of 3 years of experience in Site Reliability Engineering
- Strong experience with Amazon Web Services (AWS) and Bare metal
- Proficiency in Grafana, Prometheus, and UNIX shell scripting
- Demonstrable experience with geo-distributed and highly available production services
- Strong networking troubleshooting skills and proven experience in Linux system administration
- Familiarity with data center operations
- Fluent verbal and written communication skills in English (B2 level)
Nice to have
- Familiarity with Incident management and SLA/SLO/SLI
- Experience in performance tuning
- Knowledge of Kubernetes, Docker
- Expertise in Python and Terraform
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn