Lead Site Reliability Engineer
Remote in Argentina
Site Reliability Engineering
& 13 others
Argentina
We are looking for an experienced Lead Site Reliability Engineer to join our team and drive the delivery of cutting-edge infrastructure solutions.
You will play a key role in building and maintaining Kubernetes and GPU clusters, automating processes, implementing Azure security and IAM configurations, enhancing observability, and streamlining SRE workflows to ensure the scalability, reliability, performance, and cost-effectiveness of research workloads.
Responsibilities
- Set up, configure, and manage GPU-enabled Kubernetes clusters and standalone Linux GPU systems
- Apply workload separation policies using namespaces, RBAC, and GPU scheduling tools like Volcano
- Build and maintain automation tools and scripts in Python, Bash, and PowerShell to improve deployment and operational processes
- Manage Kubernetes applications with Helm, ensuring efficient version control, rollbacks, and environment-specific configurations
- Implement Azure security measures, including RBAC, SAS tokens, managed identities, and private endpoints, in line with Microsoft standards
- Actively contribute to SRE practices such as defining SLOs and SLIs, handling incident response, participating in on-call rotations, and optimizing resource costs
- Deploy and manage observability systems, including Prometheus for metrics, Grafana for dashboards, and Loki for log management
- Set up alerting systems with tools like PagerDuty or similar platforms
- Diagnose and resolve infrastructure-related issues, collaborating with cross-functional teams to meet strict deadlines
- Enhance CI/CD pipelines using Azure DevOps for seamless integration and continuous deployment
Requirements
- Minimum 5 years of experience in a DevOps or SRE position managing large-scale infrastructure
- At least one year of experience leading and mentoring development teams
- Deep expertise in Kubernetes administration, including namespaces, POD scheduling, PVC, NFS, and GPU workload management
- Demonstrated experience managing GPU compute clusters in Kubernetes and standalone HPC environments
- Advanced skills in Python scripting for automation, along with proficiency in Bash and PowerShell
- Strong knowledge of the Azure platform, particularly in security and IAM configurations like RBAC, SAS tokens, managed identities, and private endpoints
- Solid Linux administration skills, including troubleshooting and system performance optimization
- Hands-on experience with CI/CD pipelines, with a preference for Azure DevOps expertise
- Familiarity with SRE concepts such as SLOs, SLIs, incident response, on-call rotations, and cost efficiency
- Experience working with observability tools like Prometheus, Grafana, and Loki
- Excellent English communication skills, both written and spoken, at a B2+ level or higher
Nice to have
- Advanced knowledge of Helm for packaging, deploying, and managing Kubernetes applications
- Familiarity with Volcano Kubernetes scheduler for handling complex GPU workloads
- Experience configuring alerting systems using PagerDuty or similar tools
- Multi-cloud Kubernetes expertise, including Amazon EKS and Google GKE
- Proficiency with Infrastructure as Code tools like Terraform or comparable solutions
- Understanding of Azure networking, including VPNs, ExpressRoute, and network security setups
- Exposure to additional observability stacks such as ELK or OpenTelemetry
- Knowledge of AI-assisted development platforms like GitHub Copilot, ChatGPT, and Claude
We offer/Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn