Skip To Main Content
backBack to Search

Senior Site Reliability Engineer

Remote in Argentina
Site Reliability Engineering
& 13 others

We are seeking a highly skilled Senior Site Reliability Engineer to join our team.

In this role, you will be responsible for delivering Kubernetes and GPU cluster infrastructure, automation, Azure security/IAM implementation, observability, and SRE processes to ensure the performance, reliability, scalability, and cost efficiency of high-demand research workloads.

Responsibilities
  • Deploy, configure, and maintain GPU-enabled Kubernetes clusters and standalone Linux GPU hosts
  • Implement workload isolation policies using namespaces, RBAC, and GPU scheduling technologies like Volcano
  • Develop and maintain automation scripts in Python, Bash, and/or PowerShell for deployment, configuration, and operational optimization
  • Package and manage Kubernetes applications using Helm, ensuring proper version control, rollbacks, and environment configuration
  • Ensure security compliance in Azure by implementing RBAC, SAS tokens, managed identities, and private endpoints
  • Participate actively in SRE practices, including defining SLOs/SLIs, managing incidents, executing on-call rotations, and driving resource cost optimization
  • Implement and manage observability solutions with tools like Prometheus for metrics, Grafana for visualization, and Loki for logging
  • Configure alerting workflows using tools such as PagerDuty or similar solutions
  • Troubleshoot infrastructure issues, resolve incidents efficiently, and collaborate with cross-functional teams to meet tight deadlines
  • Contribute to CI/CD pipeline configuration and optimization using Azure DevOps to support continuous integration and deployment
Requirements
  • Minimum 3 years of experience in a DevOps or SRE role working with complex infrastructure at scale
  • Expert proficiency in Kubernetes administration, including namespaces, POD scheduling, PVC, NFS, and GPU workloads
  • Proven experience managing GPU compute clusters within Kubernetes and standalone Linux-based HPC nodes
  • Advanced Python scripting skills for infrastructure automation, plus proficiency with Bash and/or PowerShell
  • Strong Azure platform experience, including security and IAM (RBAC, SAS tokens, managed identities, private endpoints)
  • Advanced Linux administration, troubleshooting, and system optimization skills
  • Strong knowledge of CI/CD pipelines, ideally with Azure DevOps
  • Familiarity with SRE methodologies, including SLOs/SLIs, incident management, on-call processes, and cost optimization
  • Hands-on experience with observability tools such as Prometheus, Grafana, and Loki
  • Fluent English communication skills (written and spoken) at a B2+ level or higher
Nice to have
  • Advanced Helm expertise for application packaging, deployment, and environment management
  • Experience with Volcano Kubernetes scheduler for advanced GPU workload handling
  • Familiarity with PagerDuty or equivalent tools for alerting and on-call workflow management
  • Multi-cloud Kubernetes experience, including Amazon EKS and Google GKE
  • Experience with Infrastructure as Code tools like Terraform or similar technologies
  • Knowledge of Azure networking, including VPN, ExpressRoute, and network security configuration
  • Experience working with additional observability/monitoring stacks such as ELK or OpenTelemetry
  • Familiarity with AI-assisted development tools like GitHub Copilot, ChatGPT, and Claude
We offer/Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn