Back to Search
Reliability Engineer
Site Reliability Engineering, Azure Compute, Azure Pipelines, Azure Storage, Google Cloud Monitoring, PowerShell, Python, Azure Containers, Azure DevOps, Google Cloud Platform, Kubernetes
We are seeking a Reliability Engineer to join our remote team. In this role, you will ensure our information systems' stability, integrity, and efficiency, which support core organizational functions. You will also be instrumental in identifying and resolving issues that affect the reliability of our systems and services. A successful candidate will thrive in a fast-paced environment and be committed to proactive service optimization and issue prevention.
Responsibilities
- Monitor system performance and reliability, identifying and resolving issues before they impact users
- Develop and implement maintenance procedures to reduce system downtime and increase overall efficiency
- Collaborate with development teams to enhance system design and architecture with a focus on reliability and scalability
- Conduct root cause analysis on incidents to prevent recurrence
- Optimize system configurations and settings for improved performance and reliability
- Implement and manage monitoring tools and software to provide critical operational metrics and insights
Requirements
- Minimum of 2 years experience as a Reliability Engineer
- Proven scripting skills in Python and PowerShell to automate tasks and processes
- Strong knowledge of cloud platforms, specifically Azure and GCP
- Experience with Azure DevOps pipelines for continuous integration and deployment
- Proficient in debugging and troubleshooting complex software and hardware issues
- Familiarity with monitoring tools such as GCP Cloud Logging, Grafana, and Azure Logs
- Solid understanding of Site Reliability Engineering (SRE) principles and practices
- Fluent English communication skills at a B2 level or higher
Nice to have
- Experience with Kubernetes and container technologies
- Ability to lead cross-functional projects and initiatives to improve system reliability
- Experience in implementing disaster recovery plans and failover mechanisms to ensure high availability and business continuity
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn