Skip To Main Content
backBack to Search

Lead Site Reliability Engineer

Remote in Brazil
Site Reliability Engineering
& 3 others

We are seeking a highly skilled Lead Site Reliability Engineer to join our team and help us deliver optimal performance, efficiency, and maximum business value to our customer. As an SRE, you will take a software engineering approach to solving operational challenges, applying a customer and production-focused lens to ensure reliability and scalability. You will collaborate closely with DevOps, Platform, and other engineering teams, recognizing and leveraging the unique expertise and perspectives each brings to the table.

Your work will be instrumental in enhancing the SDLC through automation, innovation, and a commitment to continuous improvement. You will also play a key role in fostering a culture of collaboration, learning, and experimentation, while driving fast-flow delivery of high-quality, secure software.

Responsibilities
  • Design, build, and maintain scalable, reliable, and efficient systems to support fast and secure software delivery
  • Automate and streamline software delivery pipelines to enable fast flow while maintaining traceability and transparency
  • Develop and maintain CI/CD pipelines with integrated shift-left quality and security testing
  • Monitor and improve production systems to ensure optimal performance, availability, and reliability for end users
  • Collaborate with development, testing, security, and operations teams to promote shared goals and deliver maximum business value
  • Incorporate customer-focused metrics (e.g., SLAs, SLOs, SLIs) into reliability strategies to meet and exceed expectations
Requirements
  • Minimum 5 years’ expertise in site reliability engineering or comparable DevOps roles
  • 1+ years of demonstrated leadership experience
  • Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI, etc.) and version control systems (e.g., Git)
  • Expertise in cloud platforms (AWS) and container orchestration tools (e.g., Kubernetes, Docker)
  • Deep understanding of infrastructure as code (IaC) tools (e.g., Terraform, Ansible, CloudFormation)
  • Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, New Relic)
  • Familiarity with automated testing frameworks and shift-left security practices
  • Excellent written and spoken English (B2+ level)
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn