Senior Site Reliability Engineer

Site Reliability Engineering

Location-specific conditions & benefits*

Argentina

We are seeking a highly skilled Senior Site Reliability Engineer to join our team.

In this role, you will be responsible for delivering Kubernetes and GPU cluster infrastructure, automation, Azure security/IAM implementation, observability, and SRE processes to ensure the performance, reliability, scalability, and cost efficiency of high-demand research workloads.

Responsibilities

Deploy, configure, and maintain GPU-enabled Kubernetes clusters and standalone Linux GPU hosts
Implement workload isolation policies using namespaces, RBAC, and GPU scheduling technologies like Volcano
Develop and maintain automation scripts in Python, Bash, and/or PowerShell for deployment, configuration, and operational optimization
Package and manage Kubernetes applications using Helm, ensuring proper version control, rollbacks, and environment configuration
Ensure security compliance in Azure by implementing RBAC, SAS tokens, managed identities, and private endpoints
Participate actively in SRE practices, including defining SLOs/SLIs, managing incidents, executing on-call rotations, and driving resource cost optimization
Implement and manage observability solutions with tools like Prometheus for metrics, Grafana for visualization, and Loki for logging
Configure alerting workflows using tools such as PagerDuty or similar solutions
Troubleshoot infrastructure issues, resolve incidents efficiently, and collaborate with cross-functional teams to meet tight deadlines
Contribute to CI/CD pipeline configuration and optimization using Azure DevOps to support continuous integration and deployment

Requirements

Minimum 3 years of experience in a DevOps or SRE role working with complex infrastructure at scale
Expert proficiency in Kubernetes administration, including namespaces, POD scheduling, PVC, NFS, and GPU workloads
Proven experience managing GPU compute clusters within Kubernetes and standalone Linux-based HPC nodes
Advanced Python scripting skills for infrastructure automation, plus proficiency with Bash and/or PowerShell
Strong Azure platform experience, including security and IAM (RBAC, SAS tokens, managed identities, private endpoints)
Advanced Linux administration, troubleshooting, and system optimization skills
Strong knowledge of CI/CD pipelines, ideally with Azure DevOps
Familiarity with SRE methodologies, including SLOs/SLIs, incident management, on-call processes, and cost optimization
Hands-on experience with observability tools such as Prometheus, Grafana, and Loki
Fluent English communication skills (written and spoken) at a B2+ level or higher

Nice to have

Advanced Helm expertise for application packaging, deployment, and environment management
Experience with Volcano Kubernetes scheduler for advanced GPU workload handling
Familiarity with PagerDuty or equivalent tools for alerting and on-call workflow management
Multi-cloud Kubernetes experience, including Amazon EKS and Google GKE
Experience with Infrastructure as Code tools like Terraform or similar technologies
Knowledge of Azure networking, including VPN, ExpressRoute, and network security configuration
Experience working with additional observability/monitoring stacks such as ELK or OpenTelemetry
Familiarity with AI-assisted development tools like GitHub Copilot, ChatGPT, and Claude

We offer/Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn