Skip To Main Content
backBack to Search

Senior DevOps Engineer (HPC)

Remote in Brazil,
& 2 others
DevOps
& 2 others

We are seeking a Senior DevOps Engineer to enhance HPC application workflows and optimize scientific builds in a collaborative environment.

Join our team to support and enhance application-build workflows using Jenkins, EasyBuild, and Ansible for high-performance computing (HPC) environments. You will work closely with scientific users globally to optimize workflows, profile applications, and suggest improvements. Apply now to contribute your expertise and make a significant impact.

Responsibilities
  • Support development of application build workflows with Jenkins, easybuild, and Ansible for HPC environments
  • Optimize scientific application builds and automate testing procedures
  • Collaborate with scientific users to identify and resolve workflow issues
  • Profile applications in HPC environments and recommend performance optimizations
  • Maintain and troubleshoot Linux systems supporting HPC clusters
  • Coordinate with globally distributed users and teams to ensure smooth operations
  • Document workflows and share best practices with users and team members
  • Implement and monitor workload management using Altair Grid Engine
  • Assist in setting up and configuring CUDA, OpenMPI, TensorFlow, and PyTorch environments
  • Evaluate and integrate new tools and technologies to improve HPC workflows
  • Provide proactive support and respond to user requirements and constraints
  • Ensure compliance with security and operational policies in HPC environments
  • Participate in continuous improvement initiatives to enhance HPC infrastructure
  • Communicate effectively with users of varying technical expertise
  • Train and mentor junior team members and users on HPC systems and workflows
Requirements
  • Expert understanding of Linux systems with 3+ years of experience in DevOps
  • Experience with HPC clusters and workload managers such as Altair Grid Engine for 3+ years
  • Proven experience developing workflows for application builds and automated testing
  • Strong knowledge of CUDA, OpenMPI, TensorFlow, and PyTorch setup and configuration
  • Familiarity with AWS cloud services and HPC integration
  • Experience working with Infiniband networking technology
  • Ability to work tactfully with users of varied technical competence
  • Proactive attitude toward problem-solving and workflow optimization
  • Collaborative mindset with experience working in globally distributed teams
  • Ability to understand and incorporate user requirements and constraints
  • Strong organizational and documentation skills
  • Experience supporting scientific or research environments
  • Strong written and verbal communication skills in English at the B2+ level
Nice to have
  • Understanding of drug development and workflows in biotech/pharma R&D environments
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn