Skip To Main Content
backBack to Search

Senior Site Reliability Engineer

Hybrid in Ukraine: Lviv
DevOps
& 7 others

We are looking to hire a highly skilled Senior Site Reliability Engineer (SRE) to join our Platform Engineering team.

The ideal candidate should have expertise in DevOps and a deep understanding of Service Level Management (SLM) metrics, along with experience in event-driven infrastructure projects utilizing tools like Terraform, New Relic, Kubernetes, AWS, and Kafka. In this role, you will serve as a vital member of the Platform Engineering team, collaborating with other engineering groups to ensure our platform infrastructure tools meet their needs and positively impact Developer Experience. Additionally, you will assist teams in identifying the appropriate configurations and thresholds for alerts or automations within their applications.

Responsibilities
  • Design scalable and highly available systems, implementing solutions that use load balancing, auto-scaling patterns, canary releases, and blue-green deployments
  • Develop monitoring and logging dashboards with tools such as New Relic, Prometheus, Grafana, and Datadog, ensuring observability through metrics, tracing, log aggregation, and alerting
  • Assist teams in defining settings and thresholds for application-specific alerts and automations, acknowledging varying application performance requirements like response times and resource constraints
  • Monitor system reliability and optimize performance using tools such as New Relic while applying DORA metrics to enhance development and operational performance, and maintain compliance with SLM metrics like SLAs, SLOs, and SLIs
  • Advocate for and implement "Chaos" engineering practices to strengthen system resiliency
  • Collaborate with cross-functional teams to improve platform engineering practices and ensure effective metrics analysis
Requirements
  • Knowledge of Infrastructure-as-Code tooling, such as Terraform, for infrastructure management
  • Understanding of scalability and high availability patterns, including load balancing, auto-scaling, canary releases, and blue-green deployments
  • Proficiency in DevOps metrics (e.g., DORA) to measure and improve development and operational performance
  • Familiarity with Service Level Management (SLM) metrics (e.g., SLAs, SLOs, and SLIs) to define, monitor, and ensure compliance within expected standards
  • Expertise in monitoring, logging, and observability tools such as New Relic, Prometheus, Grafana, and Datadog
  • Background in using Kafka to enhance the performance of event-driven, real-time data processing and streaming architectures
  • Competency in tools that measure SLM, DevOps, and DORA metrics, including Apache DevLake, Grafana, and New Relic
  • Skills in managing cloud infrastructure with providers such as AWS, Azure, or GCP
  • Proficiency in CI/CD pipeline tools such as GitHub Actions, Jenkins, or GitLab CI
  • Analytical skills to interpret metrics and provide actionable improvements
  • Strong communication skills to foster collaboration within teams and with stakeholders
Nice to have
  • Understanding of Observability-as-Code tools and best practices
  • Background in using "Chaos" engineering methodologies to enhance system resiliency
Looking for something else?

Find a vacancy that works for you. Send us your CV to receive a personalized offer.

Find me a job