Skip To Main Content
backBack to Search

Senior Site Reliability Engineer (SRE)

Remote in Poland, Ukraine
DevOps
& 9 others

We are looking for an experienced Senior Site Reliability Engineer (SRE) to join our Platform Engineering team and enhance the reliability, scalability, and observability of our systems.

You will work closely with cross-functional teams to implement best practices, improve Developer Experience, and ensure compliance with critical SLM and performance metrics. Your role will involve optimizing infrastructure, creating automation solutions, and collaborating on event-driven architectures using tools like Terraform, Kubernetes, AWS, Kafka, and New Relic.

Responsibilities
  • Design and implement scalable and highly available systems using techniques such as load balancing, canary releases, blue-green deployments, and auto-scaling
  • Develop and maintain monitoring, logging, and observability dashboards using tools like New Relic, Prometheus, Grafana, and Datadog
  • Assist teams in determining appropriate settings and thresholds for alerts and automation, accounting for variance in application performance requirements
  • Ensure compliance with SLAs, SLOs, SLIs, and DORA metrics by monitoring system performance and tracking targets such as uptime, response times, and incident resolution times
  • Advocate for system resiliency by implementing and promoting "Chaos" engineering practices
  • Collaborate with cross-functional teams to enhance platform engineering practices and guide the adoption of improved tooling and metrics analysis
  • Analyze system performance and reliability metrics to drive data-informed improvements in platform infrastructure
  • Improve performance and scalability of event-driven architectures using tools like Kafka
  • Manage cloud infrastructure solutions across AWS, Azure, or GCP in line with business needs
Requirements
  • 5+ years of experience with Infrastructure-as-Code tooling such as Terraform
  • Extensive knowledge of DevOps metrics like DORA (e.g., deployment frequency, change failure rates) and Service Level Management (SLAs, SLOs, SLIs)
  • Expertise in monitoring and observability tools such as New Relic, Prometheus, Grafana, or Datadog
  • Strong experience in designing scalable architectures with load balancing, canary releases, and auto-scaling methodologies
  • Proficiency in working with cloud platforms such as AWS, Azure, or GCP
  • Background in CI/CD pipelines using tools like GitHub Actions, Jenkins, or GitLab CI
  • Experience with Kafka for real-time event-driven data processing and performance improvement
  • Understanding of SLM tooling and metrics platforms, such as Apache DevLake, Grafana, and New Relic
Nice to have
  • Familiarity with Observability-as-Code practices and tooling
  • Background in implementing "Chaos" engineering practices to validate system resiliency
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn