Lead AI Platform Engineer - Databricks & AWS
Hybrid in Mexico
Platform Engineering
& 11 others
We are looking for a Lead AI Platform Engineer to architect, deploy, and manage scalable Databricks platforms on AWS that support advanced ML and analytics pipelines.
In this role, you will work closely with data scientists and ML engineers to enhance the Lakehouse developer environment and drive innovation in AI infrastructure. Join us to lead the development of state-of-the-art AI platform solutions.
Responsibilities
- Architect and deploy scalable Databricks platform solutions for analytics, machine learning, and GenAI workflows across multiple environments
- Manage and enhance Databricks workspaces, including cluster policies, autoscaling, GPU compute, and job clusters
- Oversee Unity Catalog governance by managing metastores, catalogs, schemas, data sharing, masking, lineage, and access control
- Develop and maintain Infrastructure as Code with Terraform to enable automated, consistent platform provisioning
- Establish CI/CD pipelines for notebooks, libraries, DLT processes, and ML assets using GitHub Actions and Databricks APIs
- Standardize experiment tracking and model registry workflows with MLflow and manage model serving endpoints with monitoring and rollback
- Optimize Delta Lake batch and streaming pipelines using Auto Loader, Structured Streaming, and DLT while ensuring data quality and SLA compliance
- Collaborate with cross-functional teams to integrate platform features and deliver an exceptional developer experience
- Monitor system performance, troubleshoot issues, and implement enhancements to guarantee platform reliability and scalability
- Document platform operations and maintain automation runbooks for governance and support
- Coordinate with security teams to enforce data governance, encryption, and compliance standards
- Champion best practices in coding, testing, and deployment across the platform engineering team
- Drive ongoing improvements in automation and operational efficiency for the platform
- Engage stakeholders to capture requirements and provide expert technical guidance
- Lead and mentor junior engineers, sharing expertise in platform technologies
Requirements
- Proven expertise administering Databricks on AWS including Unity Catalog governance and enterprise integrations with at least 5 years in platform engineering
- Comprehensive knowledge of AWS services such as VPC, IAM, KMS, S3, CloudWatch, and network architecture
- Advanced skills with Terraform including the Databricks provider and experience with Infrastructure as Code for cloud environments
- Strong proficiency in Python and SQL, including packaging libraries and managing notebooks and repositories
- Experience using MLflow for experiment tracking, model registry, and model serving endpoints
- Familiarity with Delta Lake, Auto Loader, Structured Streaming, and DLT technologies
- Solid experience implementing DevOps automation, CI/CD pipelines, and using GitHub Actions or similar tools
- Expertise in Git and GitHub, including code review processes and branching strategies
- Working knowledge of REST APIs, Databricks CLI, and automation scripting
- Excellent communication and stakeholder management abilities
- Capacity to work autonomously and within distributed teams
- Detail-focused with strong problem-solving and organizational skills
- English language proficiency at B2 (Upper-Intermediate) level or above
Nice to have
- Hands-on experience with AWS EKS and Kubernetes
- Understanding of MLOps methodologies and pipeline automation
- Knowledge of attribute-based access control and enhanced data governance frameworks
- Experience with Secrets management and SSO/SCIM provisioning
- Relevant certifications in AWS or Databricks platform engineering