Senior Site Reliability Engineer

Remote in Poland

Site Reliability Engineering

Sorry, this position is no longer available

Location-specific conditions & benefits*

Poland

We are currently seeking a Senior Site Reliability Engineer to join our team remotely.

Our focus is operational excellence in managing our systems. We don't resort to manual methods for solving operational problems related to availability, scalability, latency, among others. Instead, we develop software for it and make judicious decisions regarding the tech to build upon.

Responsibilities

The development, design, and implementation of systems software that enhances the stability, scalability, availability, and robustness of the client’s products and services
Creation of patterns for automation, instrumentation, etc., for reuse across teams and products
Ownership of several services and products
Automation of operational issues rather than manual fixing
Development and implementation of strategies for effective and proactive monitoring and observability of the systems
Provision of senior technical leadership during Major Incident calls
Management of cross-functional technical resources post Major Incidents to ensure root cause understanding and documentation
Troubleshooting and fixing system issues in a complex distributed landscape
Participation in an on-call rotation, including weekend or after-hours coverage
Oversight and continuous improvement of incident-response processes at the client's end
Advocacy of engineering best practices across the company
Contribution to client's growth through interviewing and onboarding

Requirements

3+ years of relevant experience in a DevOps or SRE role
Experience with public cloud infrastructure (e.g., AWS, Azure) and related technologies (e.g., Docker, Kubernetes, Cloud Formation)
Comprehensive understanding of storage and database systems, caching and queuing, networking
Experience in leading technical recoveries
Working knowledge of Service Management practices (ITIL)
Experience in designing, analyzing, and troubleshooting distributed systems
Ability to debug, optimize code and automate routine operational tasks
Strong foundation in Linux or Windows administration and troubleshooting
Familiarity with monitoring/observability technologies like Prometheus, Grafana, Kibana, Elasticsearch
Understanding of Service level agreements and objectives
Excellent command of the English language, both written and spoken
Solid understanding of programming principles and proficiency in at least one programming language relevant for infrastructure work

Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Senior Site Reliability Engineer

These jobs are for you