Lead Site Reliability Engineer
Remote in Poland
Site Reliability Engineering
& 13 others

Sorry, this position is no longer available
Poland
Our team is actively searching for a remote Lead Site Reliability Engineer.
We prioritize operational excellence in our system management, actively avoiding manual methods for operational complexities such as availability, scalability, and latency. Instead, we lean towards software development and conscientious tech selection.
Responsibilities
- Improvement of client products and services' stability, scalability, availability, and robustness through software development, design, and implementation
- Development of automation, instrumentation, and other patterns for reuse across various teams and products
- Ownership of multiple services and products
- Automation of operational issues as an alternative to manual solutions
- Design and implementation of strategies for effective, proactive system monitoring and observability
- Provision of senior technical leadership during Major Incident calls
- Coordination of cross-functional technical resources post Major Incidents to ensure root cause understanding and documentation
- Troubleshooting and resolution of system issues in a complex distributed landscape
- Participation in an on-call rotation, which may include weekend or after-hours coverage
- Supervision and continuous improvement of incident-response processes
- Promotion of engineering best practices across the company
- Contribution to client growth through interviewing and onboarding
Requirements
- At least 5 years of relevant experience in a DevOps or SRE role
- A minimum of 1 year of relevant leadership experience
- Proficiency with public cloud infrastructure (e.g., AWS, Azure) and related technologies (e.g., Docker, Kubernetes, Cloud Formation)
- In-depth understanding of storage and database systems, caching and queuing, networking
- Demonstrated experience in leading technical recoveries
- Working knowledge of Service Management practices (ITIL)
- Experience in designing, analyzing, and troubleshooting distributed systems
- Capability to debug, optimize code and automate routine operational tasks
- Solid foundation in Linux or Windows administration and troubleshooting
- Familiarity with monitoring/observability technologies like Prometheus, Grafana, Kibana, Elasticsearch
- Understanding of Service level agreements and objectives
- Excellent command of the English language, both written and spoken
- Strong understanding of programming principles and proficiency in at least one programming language relevant for infrastructure work
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn