Skip To Main Content
backBack to Search

Senior ML Infrastructure Engineer

Remote in Argentina,
& 2 others
Machine Learning Engineering
& 7 others

We are seeking a Senior ML Infrastructure Engineer to bolster our MLOps team, overseeing the development and maintenance of our enterprise machine learning platform while driving innovation in scalable ML infrastructure and deployment practices.

Responsibilities
  • Provide expert guidance on ML technologies, tools, and MLOps best practices focused on model observability, tracking, and deployment
  • Build and maintain robust batch processing and ML inference pipelines to enable efficient model execution
  • Automate ML model deployment processes with CI/CD pipelines to streamline production workflows
  • Monitor the health, performance, reliability, and scalability of deployed models and infrastructure
  • Integrate ML inference services seamlessly with other applications or systems
  • Enable scalable, high-performance deployments of ML models that perform well under production load
  • Collaborate directly with client stakeholders and team members to ensure requirements are met and tasks are completed effectively
  • Implement infrastructure solutions that support data processing pipelines and batch inferencing
  • Create comprehensive unit tests for ML deployment, inference, and post-processing methods
  • Maintain clear and proactive communication with team members and stakeholders to ensure alignment
Requirements
  • 3+ years of experience with AWS services and MLOps-related infrastructure, focusing on scalable ML model deployment
  • Expertise in infrastructure-as-code tools, enabling efficient and consistent infrastructure setup
  • Strong background in setting up and monitoring infrastructure for data pipelines and ML inference pipelines
  • Demonstrated task ownership abilities, with experience working directly with client stakeholders and cross-functional teams
  • Skills in writing unit tests for ML deployment, inference, and related methods to ensure code reliability
  • Clear and effective communication skills with the ability to seek clarifications when needed
Nice to have
  • Experience with Google Cloud Platform (GCP) and its ML-related services
  • Competency in working with Snowflake as a data platform for ML workflows
  • Familiarity with Feature Store platforms to improve feature management
  • Background in using Spark and AWS Elastic MapReduce (EMR) for distributed data processing
  • Understanding of data curation best practices for ML model training and enabling high-quality datasets
  • Flexibility to participate in on-call rotations, ensuring system reliability in production environments
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn