Skip To Main Content
backBack to Search

Site Reliability Engineer – Azure DevOps

Remote in Mexico
Site Reliability Engineering
& 10 others

Join our team as a Site Reliability Engineer, where you will ensure system reliability, manage incident responses, and enable seamless collaboration between operations and development teams.

This role demands a background in Oil & Gas combined with expertise in automation and cloud technologies. Apply now to support critical infrastructure and drive operational excellence.

Responsibilities
  • Oversee and enhance the product monitoring system
  • Handle incidents, including troubleshooting, resolution, documentation, and analysis
  • Distribute knowledge and insights across teams
  • Facilitate collaboration between operations and development
  • Create automation for log analysis, testing production systems, and alerting
  • Track system health, performance, and SLIs/SLOs/SLAs
  • Maintain documentation for incident management procedures
  • Conduct incident analyses and implement corrective actions
  • Respond to on-call support requests during and after business hours
  • Collaborate with teams to enhance system efficiency and reliability
  • Leverage tools such as PagerDuty, ELK/Kibana, SEQ logging, Prometheus, and Grafana for system monitoring
  • Develop scripts and implement automation solutions using Python, C#, and Bash
  • Manage orchestration and infrastructure through SaltStack and Docker
  • Support project workflows using Azure DevOps and maintain a comprehensive Wiki
  • Maintain code repositories and implement version control systems using Git
Requirements
  • 1+ years of experience in creating solutions, particularly in Site Reliability Engineering
  • Expertise in cloud services and automation scripting with Python and Bash
  • Background in Oil & Gas operations and incident handling
  • Skill in managing incident responses and providing on-call support
  • Familiarity with monitoring tools such as Prometheus and Grafana
  • Proficiency in logging tools like ELK/Kibana and SEQ logging
  • Knowledge of orchestration and infrastructure solutions including SaltStack and Docker
  • Understanding of fundamental networking concepts like inbound/outbound rules and firewalls
  • Proficiency in tools for project management and issue tracking like Azure DevOps
  • Capability to manage source code with Git
  • Strong skills in creating documentation and disseminating knowledge
  • Competency in conducting detailed post-incident reviews
  • Excellent troubleshooting abilities and problem-solving skills
  • Effective communication skills, with an English level of at least B2
Nice to have
  • Experience using PagerDuty for incident handling
  • Competency in C# programming
  • Understanding of SQL and MongoDB databases
  • Background in Zededa infrastructure
  • Experience in supporting Oil & Gas field operations