Skip To Main Content
backBack to Search

Lead SRE Engineer

Office in India: Chennai
Site Reliability Engineering
& 11 others
Looking for something else?

Find a vacancy that works for you. Send us your CV to receive a personalized offer.

Find me a job

We are seeking a Lead SRE Engineer to drive the reliability, scalability and performance of our cloud-based platforms. This role is pivotal in leading digital transformation initiatives, enabling product teams and partners to adopt and integrate cloud services with a customer-centric mindset.

Responsibilities
  • Lead the design and implementation of highly available and resilient cloud infrastructure
  • Define and enforce best practices for Infrastructure-as-Code and automation
  • Oversee CI/CD pipeline development and optimization across multiple teams
  • Mentor and guide engineers in SRE and DevOps methodologies
  • Collaborate with product and platform teams to ensure seamless integration of cloud services
  • Establish and maintain robust monitoring, alerting and observability frameworks
  • Develop and implement disaster recovery and business continuity strategies
  • Drive incident management processes and post-mortem analysis for continuous improvement
  • Ensure security, compliance and identity/access management standards are met
  • Communicate platform value and reliability initiatives to stakeholders
Requirements
  • 8-14 years of experience in Site Reliability Engineering, DevOps or Cloud Infrastructure roles with at least 2 years in a leadership or mentoring capacity
  • Deep knowledge of AWS services including EC2, S3, RDS, IAM, VPC, Lambda and CloudFormation or Terraform
  • Expertise in Infrastructure-as-Code using Terraform, AWS CDK or CloudFormation
  • Proficiency in CI/CD tools such as Jenkins, GitHub Actions or GitLab CI
  • Skills in containerization and orchestration with Docker, Kubernetes, ECS or EKS
  • Competency in monitoring and observability tools like Datadog, New Relic, Prometheus, Grafana, ELK or CloudWatch
  • Background in scripting or programming with Python, Bash or Go
  • Understanding of networking, security and identity/access management in cloud environments
  • Experience designing high-availability and disaster recovery strategies for critical workloads
  • Excellent communication, problem-solving and leadership skills with the ability to influence across teams
Nice to have
  • Experience with AIOps, Serverless Architectures and event-driven systems
  • Familiarity with FinOps practices and cost optimization frameworks
  • Experience with SaaS monitoring tools such as Datadog, New Relic, Sumo Logic or PagerDuty
  • Exposure to Atlassian tools including Jira, Confluence or Bitbucket
  • Experience with SQL or NoSQL databases
  • Showcase of leading cross-functional reliability initiatives or platform-wide automation projects