Senior Site Reliability Engineer
Remote in Poland
Site Reliability Engineering
& 13 others

Sorry, this position is no longer available
Poland
We are currently seeking a Senior Site Reliability Engineer to join our team remotely.
Our focus is operational excellence in managing our systems. We don't resort to manual methods for solving operational problems related to availability, scalability, latency, among others. Instead, we develop software for it and make judicious decisions regarding the tech to build upon.
Responsibilities
- The development, design, and implementation of systems software that enhances the stability, scalability, availability, and robustness of the client’s products and services
- Creation of patterns for automation, instrumentation, etc., for reuse across teams and products
- Ownership of several services and products
- Automation of operational issues rather than manual fixing
- Development and implementation of strategies for effective and proactive monitoring and observability of the systems
- Provision of senior technical leadership during Major Incident calls
- Management of cross-functional technical resources post Major Incidents to ensure root cause understanding and documentation
- Troubleshooting and fixing system issues in a complex distributed landscape
- Participation in an on-call rotation, including weekend or after-hours coverage
- Oversight and continuous improvement of incident-response processes at the client's end
- Advocacy of engineering best practices across the company
- Contribution to client's growth through interviewing and onboarding
Requirements
- 3+ years of relevant experience in a DevOps or SRE role
- Experience with public cloud infrastructure (e.g., AWS, Azure) and related technologies (e.g., Docker, Kubernetes, Cloud Formation)
- Comprehensive understanding of storage and database systems, caching and queuing, networking
- Experience in leading technical recoveries
- Working knowledge of Service Management practices (ITIL)
- Experience in designing, analyzing, and troubleshooting distributed systems
- Ability to debug, optimize code and automate routine operational tasks
- Strong foundation in Linux or Windows administration and troubleshooting
- Familiarity with monitoring/observability technologies like Prometheus, Grafana, Kibana, Elasticsearch
- Understanding of Service level agreements and objectives
- Excellent command of the English language, both written and spoken
- Solid understanding of programming principles and proficiency in at least one programming language relevant for infrastructure work
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn