Lead Site Reliability Engineer

Site Reliability Engineering

Location-specific conditions & benefits*

Argentina

We are looking for an experienced Lead Site Reliability Engineer to join our team and drive the delivery of cutting-edge infrastructure solutions.

You will play a key role in building and maintaining Kubernetes and GPU clusters, automating processes, implementing Azure security and IAM configurations, enhancing observability, and streamlining SRE workflows to ensure the scalability, reliability, performance, and cost-effectiveness of research workloads.

Responsibilities

Set up, configure, and manage GPU-enabled Kubernetes clusters and standalone Linux GPU systems
Apply workload separation policies using namespaces, RBAC, and GPU scheduling tools like Volcano
Build and maintain automation tools and scripts in Python, Bash, and PowerShell to improve deployment and operational processes
Manage Kubernetes applications with Helm, ensuring efficient version control, rollbacks, and environment-specific configurations
Implement Azure security measures, including RBAC, SAS tokens, managed identities, and private endpoints, in line with Microsoft standards
Actively contribute to SRE practices such as defining SLOs and SLIs, handling incident response, participating in on-call rotations, and optimizing resource costs
Deploy and manage observability systems, including Prometheus for metrics, Grafana for dashboards, and Loki for log management
Set up alerting systems with tools like PagerDuty or similar platforms
Diagnose and resolve infrastructure-related issues, collaborating with cross-functional teams to meet strict deadlines
Enhance CI/CD pipelines using Azure DevOps for seamless integration and continuous deployment

Requirements

Minimum 5 years of experience in a DevOps or SRE position managing large-scale infrastructure
At least one year of experience leading and mentoring development teams
Deep expertise in Kubernetes administration, including namespaces, POD scheduling, PVC, NFS, and GPU workload management
Demonstrated experience managing GPU compute clusters in Kubernetes and standalone HPC environments
Advanced skills in Python scripting for automation, along with proficiency in Bash and PowerShell
Strong knowledge of the Azure platform, particularly in security and IAM configurations like RBAC, SAS tokens, managed identities, and private endpoints
Solid Linux administration skills, including troubleshooting and system performance optimization
Hands-on experience with CI/CD pipelines, with a preference for Azure DevOps expertise
Familiarity with SRE concepts such as SLOs, SLIs, incident response, on-call rotations, and cost efficiency
Experience working with observability tools like Prometheus, Grafana, and Loki
Excellent English communication skills, both written and spoken, at a B2+ level or higher

Nice to have

Advanced knowledge of Helm for packaging, deploying, and managing Kubernetes applications
Familiarity with Volcano Kubernetes scheduler for handling complex GPU workloads
Experience configuring alerting systems using PagerDuty or similar tools
Multi-cloud Kubernetes expertise, including Amazon EKS and Google GKE
Proficiency with Infrastructure as Code tools like Terraform or comparable solutions
Understanding of Azure networking, including VPNs, ExpressRoute, and network security setups
Exposure to additional observability stacks such as ELK or OpenTelemetry
Knowledge of AI-assisted development platforms like GitHub Copilot, ChatGPT, and Claude

We offer/Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn