Skip To Main Content
backBack to Search

Senior Reliability Engineer

Site Reliability Engineering, Azure Compute, Azure Pipelines, Azure Storage, Google Cloud Monitoring, PowerShell, Python, Azure Containers, Azure DevOps, Google Cloud Platform, Kubernetes

We are seeking a Senior Reliability Engineer to join our remote team. This role is crucial for ensuring our systems' ongoing stability and efficiency, focusing on minimizing downtime and maximizing performance. The ideal candidate will have a proven track record of improving system reliability and a strong technical acumen in managing complex infrastructures. Your expertise will help shape our operational strategies, ensuring our services are robust and resilient against disruptions.

Responsibilities
  • Lead initiatives to enhance system reliability, availability, and resilience
  • Design and implement robust monitoring solutions to proactively identify potential issues
  • Mentor junior engineers in reliability best practices and advanced troubleshooting techniques
  • Collaborate with cross-functional teams to ensure seamless deployments and operations
  • Develop automation scripts to streamline operational processes and reduce human error
  • Conduct detailed root cause analysis for critical incidents and drive continuous improvement
  • Establish and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure system performance
  • Advocate for and implement reliability-focused changes in the software development lifecycle
Requirements
  • Minimum of 3 years experience in a Reliability Engineer role
  • Advanced scripting skills in Python and PowerShell
  • Strong knowledge of cloud platforms, specifically Azure and GCP
  • Proficient with Azure DevOps pipelines for efficient CI/CD workflows
  • Expertise in debugging and troubleshooting complex systems
  • Experience with monitoring tools such as GCP Cloud Logging, Grafana, and Azure Logs
  • In-depth understanding of Site Reliability Engineering (SRE) principles
  • Fluent English communication skills at a B2 level or higher
Nice to have
  • Experience with Kubernetes and container orchestration platforms
  • Proven ability to lead projects focused on system scalability and disaster recovery planning
  • Familiarity with advanced data analytics and machine learning tools to predict system failures
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn