Back to Search
Site Reliability Engineer (SRE)
Site Reliability Engineering, Amazon DynamoDB, Amazon ElastiCache, Amazon Web Services, Git, Gradle, Observability and troubleshooting in distributed systems, Apache Cassandra, Apache Kafka, Grafana, Java, Kubernetes, New Relic, Scala, Terraform
Sorry, this position is no longer available
We are looking for a Site Reliability Engineer (SRE) to join our team remotely.
The successful candidate will work in a team of SREs and collaborate closely with other teams to ensure 24/7 on-call support for the entirety of our customer platform. The ability to handle information dumps quickly, troubleshoot complex systems efficiently, and communicate operational issues clearly is essential for this role.
Responsibilities
- Provide 24/7 on-call support for Java backend services and API Gateway observability
- Prepare and deploy patches to Java code and related service cloud infrastructure
- Establish top-of-the-line metrics and dashboards for quick identification of overall platform health
- Improve runbooks for all End of Service (EOS) backend services
- Monitor SLOs of all involved backend services and submit code changes to improve SLO as errors occur
Requirements
- At least 2 years of experience in Site Reliability Engineering
- Proficiency in Amazon DynamoDB, Amazon ElastiCache, Amazon Web Services
- Experience with Git, Gradle, observability and troubleshooting in distributed systems
- Experience in providing 24/7 on-call support for backend services
- Strong skills in establishing and monitoring SLOs for backend services
- Fluent English communication skills at a B2+ level
Nice to have
- Experience with Docker and container orchestration tools like Kubernetes
- Knowledge of Apache Cassandra, Apache Kafka, and Grafana
- Familiarity with Java, New Relic, Scala, and Terraform
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn