Lead Site Reliability Engineer

Remote in Poland

Site Reliability Engineering

Sorry, this position is no longer available

Location-specific conditions & benefits*

Poland

Our team is actively searching for a remote Lead Site Reliability Engineer.

We prioritize operational excellence in our system management, actively avoiding manual methods for operational complexities such as availability, scalability, and latency. Instead, we lean towards software development and conscientious tech selection.

Responsibilities

Improvement of client products and services' stability, scalability, availability, and robustness through software development, design, and implementation
Development of automation, instrumentation, and other patterns for reuse across various teams and products
Ownership of multiple services and products
Automation of operational issues as an alternative to manual solutions
Design and implementation of strategies for effective, proactive system monitoring and observability
Provision of senior technical leadership during Major Incident calls
Coordination of cross-functional technical resources post Major Incidents to ensure root cause understanding and documentation
Troubleshooting and resolution of system issues in a complex distributed landscape
Participation in an on-call rotation, which may include weekend or after-hours coverage
Supervision and continuous improvement of incident-response processes
Promotion of engineering best practices across the company
Contribution to client growth through interviewing and onboarding

Requirements

At least 5 years of relevant experience in a DevOps or SRE role
A minimum of 1 year of relevant leadership experience
Proficiency with public cloud infrastructure (e.g., AWS, Azure) and related technologies (e.g., Docker, Kubernetes, Cloud Formation)
In-depth understanding of storage and database systems, caching and queuing, networking
Demonstrated experience in leading technical recoveries
Working knowledge of Service Management practices (ITIL)
Experience in designing, analyzing, and troubleshooting distributed systems
Capability to debug, optimize code and automate routine operational tasks
Solid foundation in Linux or Windows administration and troubleshooting
Familiarity with monitoring/observability technologies like Prometheus, Grafana, Kibana, Elasticsearch
Understanding of Service level agreements and objectives
Excellent command of the English language, both written and spoken
Strong understanding of programming principles and proficiency in at least one programming language relevant for infrastructure work

Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Lead Site Reliability Engineer

These jobs are for you