Senior Site Reliability Engineer
Remote in Argentina
Site Reliability Engineering
& 13 others
Argentina
We are seeking a highly skilled Senior Site Reliability Engineer to join our team.
In this role, you will be responsible for delivering Kubernetes and GPU cluster infrastructure, automation, Azure security/IAM implementation, observability, and SRE processes to ensure the performance, reliability, scalability, and cost efficiency of high-demand research workloads.
Responsibilities
- Deploy, configure, and maintain GPU-enabled Kubernetes clusters and standalone Linux GPU hosts
- Implement workload isolation policies using namespaces, RBAC, and GPU scheduling technologies like Volcano
- Develop and maintain automation scripts in Python, Bash, and/or PowerShell for deployment, configuration, and operational optimization
- Package and manage Kubernetes applications using Helm, ensuring proper version control, rollbacks, and environment configuration
- Ensure security compliance in Azure by implementing RBAC, SAS tokens, managed identities, and private endpoints
- Participate actively in SRE practices, including defining SLOs/SLIs, managing incidents, executing on-call rotations, and driving resource cost optimization
- Implement and manage observability solutions with tools like Prometheus for metrics, Grafana for visualization, and Loki for logging
- Configure alerting workflows using tools such as PagerDuty or similar solutions
- Troubleshoot infrastructure issues, resolve incidents efficiently, and collaborate with cross-functional teams to meet tight deadlines
- Contribute to CI/CD pipeline configuration and optimization using Azure DevOps to support continuous integration and deployment
Requirements
- Minimum 3 years of experience in a DevOps or SRE role working with complex infrastructure at scale
- Expert proficiency in Kubernetes administration, including namespaces, POD scheduling, PVC, NFS, and GPU workloads
- Proven experience managing GPU compute clusters within Kubernetes and standalone Linux-based HPC nodes
- Advanced Python scripting skills for infrastructure automation, plus proficiency with Bash and/or PowerShell
- Strong Azure platform experience, including security and IAM (RBAC, SAS tokens, managed identities, private endpoints)
- Advanced Linux administration, troubleshooting, and system optimization skills
- Strong knowledge of CI/CD pipelines, ideally with Azure DevOps
- Familiarity with SRE methodologies, including SLOs/SLIs, incident management, on-call processes, and cost optimization
- Hands-on experience with observability tools such as Prometheus, Grafana, and Loki
- Fluent English communication skills (written and spoken) at a B2+ level or higher
Nice to have
- Advanced Helm expertise for application packaging, deployment, and environment management
- Experience with Volcano Kubernetes scheduler for advanced GPU workload handling
- Familiarity with PagerDuty or equivalent tools for alerting and on-call workflow management
- Multi-cloud Kubernetes experience, including Amazon EKS and Google GKE
- Experience with Infrastructure as Code tools like Terraform or similar technologies
- Knowledge of Azure networking, including VPN, ExpressRoute, and network security configuration
- Experience working with additional observability/monitoring stacks such as ELK or OpenTelemetry
- Familiarity with AI-assisted development tools like GitHub Copilot, ChatGPT, and Claude
We offer/Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn