Lead ML Infrastructure Engineer
Machine Learning Engineering
& 7 others
Argentina
We are seeking a Lead ML Infrastructure Engineer to strengthen our MLOps team, focusing on the design and management of our enterprise machine learning platform while advancing scalable ML infrastructure and deployment practices.
Responsibilities
- Provide expert advice on ML technologies, tools, and MLOps best practices with an emphasis on model observability, tracking, and deployment
- Design and maintain robust batch processing and ML inference pipelines for efficient model execution
- Automate ML model deployment processes through CI/CD pipelines to enhance production workflows
- Monitor deployed models and infrastructure for health, performance, reliability, and scalability
- Ensure seamless integration of ML inference services with other applications or systems
- Enable deployments of ML models that scale efficiently and maintain high performance in production environments
- Collaborate with client stakeholders and team members to ensure requirements are understood and tasks are completed effectively
- Develop infrastructure solutions that support both data processing pipelines and batch inferencing capabilities
- Write comprehensive unit tests to ensure reliability for ML deployment, inference, and post-processing methods
- Maintain proactive and transparent communication with team members and stakeholders to ensure alignment
Requirements
- 5+ years of experience with AWS services and MLOps-focused infrastructure for scalable ML model deployment
- Expertise in infrastructure-as-code tools, enabling efficient and consistent infrastructure provisioning
- Strong background in setting up and monitoring infrastructure for data and ML inference pipelines
- Demonstrated ability to take ownership of tasks and work collaboratively with client stakeholders and teams
- Skills in writing effective unit tests for ML deployment, inference, and related methods
- Proficiency in clear communication with the ability to ask for clarification when necessary
Nice to have
- Knowledge of Google Cloud Platform (GCP) and its ML-specific services
- Proficiency in using Snowflake as a data platform for ML workflows
- Understanding of Feature Store platforms to enhance feature management processes
- Background in Spark and AWS Elastic MapReduce (EMR) for processing distributed datasets
- Familiarity with data curation best practices to support ML model training and high-quality dataset creation
- Capability to participate in on-call rotations to maintain system reliability in production environments
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn