Back to Search
Lead Systems Engineer
DevOps, GitHub Actions, Grafana, Istio, Kubernetes, Python, Terragrunt, Amazon Web Services, Prometheus, Terraform
Sorry, this position is no longer available
We invite applications for a remote Lead DevOps Engineer to join our team.
The chosen individual will operate as a vital factor in the construction and upkeep of the CVML platform. Your proficiency in DevOps, focusing on GitHub Actions, Grafana, Istio, Kubernetes, Python, Terraform, and Amazon Web Services, is critical in ensuring our system's dependability and performance.
If you have a keen interest in automation, command of modern DevOps tools and methods, and relish solving intricate issues in a team setting, this role may be ideal for you.
Responsibilities
- Development of Terraform and Terragrunt configurations for infrastructure code
- Management and creation of GitHub Actions workflows for CI/CD pipelines
- Resolution of data access permission issues in AWS S3 and AWS IAM
- Troubleshooting of Kubeflow ML pipeline issues related to CPU, Memory, GPU, and Permissions
- Scripting using Python for platform automation tasks
- Team collaboration to boost the CVML platform's reliability and efficiency
- Involvement in architecture and design discussions for system enhancements
- Keeping abreast of the latest DevOps tools and practices for continuous improvement
Requirements
- Practical DevOps roles experience of at least 5 years
- Leadership experience of 1 year or more
- Deep understanding of Kubernetes and its ecosystem, particularly AWS EKS and KubeSpray
- Proficiency in using Terraform and Terragrunt for infrastructure code
- Experience with Prometheus and Grafana for monitoring and observability
- Solid knowledge of Istio for service mesh and its basic components, such as sidecars, mTLS, and ingress gateway
- Proficiency in Python for scripting and automation tasks
- Hands-on experience with GitHub and GitHub Actions for CI/CD pipelines
- Sound understanding of AWS services including network, LoadBalancer, and IAM
- Excellent troubleshooting skills for data access permission issues in AWS S3 and AWS IAM
- Ability to develop and troubleshoot Kubeflow ML pipeline issues
Nice to have
- Familiarity with distributed tracing tools such as Zipkin and Istio
- Knowledge of Golang, Kubeflow, and Pulumi
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn