Senior DevOps Engineer (HPC)
Brazil
We are seeking a Senior DevOps Engineer to enhance HPC application workflows and optimize scientific builds in a collaborative environment.
Join our team to support and enhance application-build workflows using Jenkins, EasyBuild, and Ansible for high-performance computing (HPC) environments. You will work closely with scientific users globally to optimize workflows, profile applications, and suggest improvements. Apply now to contribute your expertise and make a significant impact.
Responsibilities
- Support development of application build workflows with Jenkins, easybuild, and Ansible for HPC environments
- Optimize scientific application builds and automate testing procedures
- Collaborate with scientific users to identify and resolve workflow issues
- Profile applications in HPC environments and recommend performance optimizations
- Maintain and troubleshoot Linux systems supporting HPC clusters
- Coordinate with globally distributed users and teams to ensure smooth operations
- Document workflows and share best practices with users and team members
- Implement and monitor workload management using Altair Grid Engine
- Assist in setting up and configuring CUDA, OpenMPI, TensorFlow, and PyTorch environments
- Evaluate and integrate new tools and technologies to improve HPC workflows
- Provide proactive support and respond to user requirements and constraints
- Ensure compliance with security and operational policies in HPC environments
- Participate in continuous improvement initiatives to enhance HPC infrastructure
- Communicate effectively with users of varying technical expertise
- Train and mentor junior team members and users on HPC systems and workflows
Requirements
- Expert understanding of Linux systems with 3+ years of experience in DevOps
- Experience with HPC clusters and workload managers such as Altair Grid Engine for 3+ years
- Proven experience developing workflows for application builds and automated testing
- Strong knowledge of CUDA, OpenMPI, TensorFlow, and PyTorch setup and configuration
- Familiarity with AWS cloud services and HPC integration
- Experience working with Infiniband networking technology
- Ability to work tactfully with users of varied technical competence
- Proactive attitude toward problem-solving and workflow optimization
- Collaborative mindset with experience working in globally distributed teams
- Ability to understand and incorporate user requirements and constraints
- Strong organizational and documentation skills
- Experience supporting scientific or research environments
- Strong written and verbal communication skills in English at the B2+ level
Nice to have
- Understanding of drug development and workflows in biotech/pharma R&D environments
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn