Skip To Main Content
backBack to Search

Lead ML Infrastructure Engineer

Remote in Argentina,
& 2 others
Machine Learning Engineering
& 7 others

We are seeking a Lead ML Infrastructure Engineer to strengthen our MLOps team, focusing on the design and management of our enterprise machine learning platform while advancing scalable ML infrastructure and deployment practices.

Responsibilities
  • Provide expert advice on ML technologies, tools, and MLOps best practices with an emphasis on model observability, tracking, and deployment
  • Design and maintain robust batch processing and ML inference pipelines for efficient model execution
  • Automate ML model deployment processes through CI/CD pipelines to enhance production workflows
  • Monitor deployed models and infrastructure for health, performance, reliability, and scalability
  • Ensure seamless integration of ML inference services with other applications or systems
  • Enable deployments of ML models that scale efficiently and maintain high performance in production environments
  • Collaborate with client stakeholders and team members to ensure requirements are understood and tasks are completed effectively
  • Develop infrastructure solutions that support both data processing pipelines and batch inferencing capabilities
  • Write comprehensive unit tests to ensure reliability for ML deployment, inference, and post-processing methods
  • Maintain proactive and transparent communication with team members and stakeholders to ensure alignment
Requirements
  • 5+ years of experience with AWS services and MLOps-focused infrastructure for scalable ML model deployment
  • Expertise in infrastructure-as-code tools, enabling efficient and consistent infrastructure provisioning
  • Strong background in setting up and monitoring infrastructure for data and ML inference pipelines
  • Demonstrated ability to take ownership of tasks and work collaboratively with client stakeholders and teams
  • Skills in writing effective unit tests for ML deployment, inference, and related methods
  • Proficiency in clear communication with the ability to ask for clarification when necessary
Nice to have
  • Knowledge of Google Cloud Platform (GCP) and its ML-specific services
  • Proficiency in using Snowflake as a data platform for ML workflows
  • Understanding of Feature Store platforms to enhance feature management processes
  • Background in Spark and AWS Elastic MapReduce (EMR) for processing distributed datasets
  • Familiarity with data curation best practices to support ML model training and high-quality dataset creation
  • Capability to participate in on-call rotations to maintain system reliability in production environments
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn