Senior Site Reliability Engineer

Hybrid in Ukraine: Lviv

We are looking to hire a highly skilled Senior Site Reliability Engineer (SRE) to join our Platform Engineering team.

The ideal candidate should have expertise in DevOps and a deep understanding of Service Level Management (SLM) metrics, along with experience in event-driven infrastructure projects utilizing tools like Terraform, New Relic, Kubernetes, AWS, and Kafka. In this role, you will serve as a vital member of the Platform Engineering team, collaborating with other engineering groups to ensure our platform infrastructure tools meet their needs and positively impact Developer Experience. Additionally, you will assist teams in identifying the appropriate configurations and thresholds for alerts or automations within their applications.

Responsibilities

Design scalable and highly available systems, implementing solutions that use load balancing, auto-scaling patterns, canary releases, and blue-green deployments
Develop monitoring and logging dashboards with tools such as New Relic, Prometheus, Grafana, and Datadog, ensuring observability through metrics, tracing, log aggregation, and alerting
Assist teams in defining settings and thresholds for application-specific alerts and automations, acknowledging varying application performance requirements like response times and resource constraints
Monitor system reliability and optimize performance using tools such as New Relic while applying DORA metrics to enhance development and operational performance, and maintain compliance with SLM metrics like SLAs, SLOs, and SLIs
Advocate for and implement "Chaos" engineering practices to strengthen system resiliency
Collaborate with cross-functional teams to improve platform engineering practices and ensure effective metrics analysis

Requirements

Knowledge of Infrastructure-as-Code tooling, such as Terraform, for infrastructure management
Understanding of scalability and high availability patterns, including load balancing, auto-scaling, canary releases, and blue-green deployments
Proficiency in DevOps metrics (e.g., DORA) to measure and improve development and operational performance
Familiarity with Service Level Management (SLM) metrics (e.g., SLAs, SLOs, and SLIs) to define, monitor, and ensure compliance within expected standards
Expertise in monitoring, logging, and observability tools such as New Relic, Prometheus, Grafana, and Datadog
Background in using Kafka to enhance the performance of event-driven, real-time data processing and streaming architectures
Competency in tools that measure SLM, DevOps, and DORA metrics, including Apache DevLake, Grafana, and New Relic
Skills in managing cloud infrastructure with providers such as AWS, Azure, or GCP
Proficiency in CI/CD pipeline tools such as GitHub Actions, Jenkins, or GitLab CI
Analytical skills to interpret metrics and provide actionable improvements
Strong communication skills to foster collaboration within teams and with stakeholders

Nice to have

Understanding of Observability-as-Code tools and best practices
Background in using "Chaos" engineering methodologies to enhance system resiliency

Looking for something else?

Find a vacancy that works for you. Send us your CV to receive a personalized offer.

Find me a job