Lead DevOps Engineer (HPC)
Brazil
We are seeking a Lead DevOps Engineer to drive the improvement of HPC application workflows and optimize scientific builds in a collaborative setting.
Be part of our team to build and refine application workflows using Jenkins, EasyBuild, and Ansible within high-performance computing (HPC) environments. Collaborate with global scientific users to improve workflows, analyze application performance, and implement better solutions. Apply now to leverage your skills and make a meaningful contribution.
Responsibilities
- Support application build workflows with Jenkins, EasyBuild, and Ansible for HPC systems
- Optimize scientific application builds and automate testing workflows
- Collaborate with scientific users to analyze and resolve workflow issues
- Conduct application profiling in HPC environments and suggest performance enhancements
- Manage and troubleshoot Linux systems supporting HPC clusters
- Facilitate communication with globally distributed users and teams for efficient operations
- Create and share documentation of workflows and best practices
- Oversee workload management through Altair Grid Engine
- Set up CUDA, OpenMPI, TensorFlow, and PyTorch environments
- Assess and integrate new tools and technologies for better HPC workflows
- Provide user-centered support and address technical requirements and constraints
- Ensure adherence to security and operational guidelines in HPC systems
- Identify opportunities for continuous improvements in HPC infrastructure
- Communicate effectively with users at varying technical levels
- Guide and mentor junior team members and users in using HPC systems and workflows
Requirements
- Proficiency in Linux systems with over 5 years of DevOps experience
- Experience working with HPC clusters and workload managers such as Altair Grid Engine for at least 4 years
- Background in creating workflows for application builds and automated testing
- Knowledge of CUDA, OpenMPI, TensorFlow, and PyTorch configuration
- Familiarity with AWS cloud services and HPC integration techniques
- Background in working with Infiniband networking technology
- Strong interpersonal skills to collaborate with users of diverse technical proficiency
- Ability to approach problems with initiative and optimize workflows
- Team-oriented attitude with experience collaborating across global teams
- Competency in addressing user requirements and constraints effectively
- Strong ability to organize and document processes
- Background in supporting operations in scientific or research environments
- Strong verbal and written communication competency in English at the B2+ level
Nice to have
- Knowledge of drug development processes and related workflows in biotech or pharmaceutical R&D environments
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn