Skip To Main Content
backBack to Search

Lead AI DevOps Engineer

Remote in Portugal
Data DevOps
& 8 others

We are seeking an experienced Lead AI DevOps Engineer to oversee the infrastructure supporting Generative AI applications.

In this role, you will design, implement, and maintain scalable cloud environments tailored for advanced AI tools and platforms. You will collaborate closely with cross-functional teams to ensure seamless integration, optimized operations, and secure management of AI frameworks.

Responsibilities
  • Architect and manage scalable and secure cloud infrastructure on GCP, including Google Kubernetes Engine (GKE) clusters and VertexAI workflows, to support Generative AI solutions
  • Integrate and maintain Python-based AI tools and frameworks like LiteLLM, Dify.AI, CrewAI, and Guardrails AI within the Agentic AI platform
  • Develop and manage automated CI/CD pipelines, infrastructure provisioning, and resource scaling to ensure efficient operations for AI platforms
  • Deploy monitoring, logging, and alerting systems to guarantee service reliability, performance, and availability while addressing operational challenges as they arise
  • Implement security measures and governance protocols to safeguard data and ensure compliance across the AI infrastructure
Requirements
  • At least 5 years of experience in DevOps, cloud infrastructure, or AI platform operations
  • A minimum of one year of experience in leading and managing development teams
  • Extensive experience with GCP, including hands-on management of Google Kubernetes Engine (GKE) and VertexAI workflows
  • Expertise in cloud-native deployments and infrastructure optimization within GKE and VertexAI environments
  • Knowledge of Generative AI concepts, tools, and applications, with demonstrated experience in integrating AI frameworks into production platforms
  • Advanced proficiency in Python, particularly for integrating AI frameworks and supporting operational requirements
  • Familiarity with containerization technologies such as Docker and Kubernetes for application orchestration and deployment
  • Experience with monitoring and logging tools like Prometheus and Grafana to ensure system observability
  • Strong understanding of AI governance, security standards, and best practices for compliance and data protection
  • Fluent English communication skills, both written and spoken, at a B2 level or higher
We offer/Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn