Reliability Engineer

Site Reliability Engineering, Azure Compute, Azure Pipelines, Azure Storage, Google Cloud Monitoring, PowerShell, Python, Azure Containers, Azure DevOps, Google Cloud Platform, Kubernetes

Facebook LinkedIn Send via email

We are seeking a Reliability Engineer to join our remote team. In this role, you will ensure our information systems' stability, integrity, and efficiency, which support core organizational functions. You will also be instrumental in identifying and resolving issues that affect the reliability of our systems and services. A successful candidate will thrive in a fast-paced environment and be committed to proactive service optimization and issue prevention.

Responsibilities

Monitor system performance and reliability, identifying and resolving issues before they impact users
Develop and implement maintenance procedures to reduce system downtime and increase overall efficiency
Collaborate with development teams to enhance system design and architecture with a focus on reliability and scalability
Conduct root cause analysis on incidents to prevent recurrence
Optimize system configurations and settings for improved performance and reliability
Implement and manage monitoring tools and software to provide critical operational metrics and insights

Requirements

Minimum of 2 years experience as a Reliability Engineer
Proven scripting skills in Python and PowerShell to automate tasks and processes
Strong knowledge of cloud platforms, specifically Azure and GCP
Experience with Azure DevOps pipelines for continuous integration and deployment
Proficient in debugging and troubleshooting complex software and hardware issues
Familiarity with monitoring tools such as GCP Cloud Logging, Grafana, and Azure Logs
Solid understanding of Site Reliability Engineering (SRE) principles and practices
Fluent English communication skills at a B2 level or higher

Nice to have

Experience with Kubernetes and container technologies
Ability to lead cross-functional projects and initiatives to improve system reliability
Experience in implementing disaster recovery plans and failover mechanisms to ensure high availability and business continuity

Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn