Skip To Main Content
backBack to Search

Lead DevOps Engineer (HPC)

Remote in Brazil,
& 2 others
DevOps
& 2 others

We are seeking a Lead DevOps Engineer to drive the improvement of HPC application workflows and optimize scientific builds in a collaborative setting.

Be part of our team to build and refine application workflows using Jenkins, EasyBuild, and Ansible within high-performance computing (HPC) environments. Collaborate with global scientific users to improve workflows, analyze application performance, and implement better solutions. Apply now to leverage your skills and make a meaningful contribution.

Responsibilities
  • Support application build workflows with Jenkins, EasyBuild, and Ansible for HPC systems
  • Optimize scientific application builds and automate testing workflows
  • Collaborate with scientific users to analyze and resolve workflow issues
  • Conduct application profiling in HPC environments and suggest performance enhancements
  • Manage and troubleshoot Linux systems supporting HPC clusters
  • Facilitate communication with globally distributed users and teams for efficient operations
  • Create and share documentation of workflows and best practices
  • Oversee workload management through Altair Grid Engine
  • Set up CUDA, OpenMPI, TensorFlow, and PyTorch environments
  • Assess and integrate new tools and technologies for better HPC workflows
  • Provide user-centered support and address technical requirements and constraints
  • Ensure adherence to security and operational guidelines in HPC systems
  • Identify opportunities for continuous improvements in HPC infrastructure
  • Communicate effectively with users at varying technical levels
  • Guide and mentor junior team members and users in using HPC systems and workflows
Requirements
  • Proficiency in Linux systems with over 5 years of DevOps experience
  • Experience working with HPC clusters and workload managers such as Altair Grid Engine for at least 4 years
  • Background in creating workflows for application builds and automated testing
  • Knowledge of CUDA, OpenMPI, TensorFlow, and PyTorch configuration
  • Familiarity with AWS cloud services and HPC integration techniques
  • Background in working with Infiniband networking technology
  • Strong interpersonal skills to collaborate with users of diverse technical proficiency
  • Ability to approach problems with initiative and optimize workflows
  • Team-oriented attitude with experience collaborating across global teams
  • Competency in addressing user requirements and constraints effectively
  • Strong ability to organize and document processes
  • Background in supporting operations in scientific or research environments
  • Strong verbal and written communication competency in English at the B2+ level
Nice to have
  • Knowledge of drug development processes and related workflows in biotech or pharmaceutical R&D environments
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn