Lead SRE Engineer

Office in India: Chennai

Site Reliability Engineering

Looking for something else?

Find a vacancy that works for you. Send us your CV to receive a personalized offer.

We are seeking a Lead SRE Engineer to drive the reliability, scalability and performance of our cloud-based platforms. This role is pivotal in leading digital transformation initiatives, enabling product teams and partners to adopt and integrate cloud services with a customer-centric mindset.

Responsibilities

Lead the design and implementation of highly available and resilient cloud infrastructure
Define and enforce best practices for Infrastructure-as-Code and automation
Oversee CI/CD pipeline development and optimization across multiple teams
Mentor and guide engineers in SRE and DevOps methodologies
Collaborate with product and platform teams to ensure seamless integration of cloud services
Establish and maintain robust monitoring, alerting and observability frameworks
Develop and implement disaster recovery and business continuity strategies
Drive incident management processes and post-mortem analysis for continuous improvement
Ensure security, compliance and identity/access management standards are met
Communicate platform value and reliability initiatives to stakeholders

Requirements

8-14 years of experience in Site Reliability Engineering, DevOps or Cloud Infrastructure roles with at least 2 years in a leadership or mentoring capacity
Deep knowledge of AWS services including EC2, S3, RDS, IAM, VPC, Lambda and CloudFormation or Terraform
Expertise in Infrastructure-as-Code using Terraform, AWS CDK or CloudFormation
Proficiency in CI/CD tools such as Jenkins, GitHub Actions or GitLab CI
Skills in containerization and orchestration with Docker, Kubernetes, ECS or EKS
Competency in monitoring and observability tools like Datadog, New Relic, Prometheus, Grafana, ELK or CloudWatch
Background in scripting or programming with Python, Bash or Go
Understanding of networking, security and identity/access management in cloud environments
Experience designing high-availability and disaster recovery strategies for critical workloads
Excellent communication, problem-solving and leadership skills with the ability to influence across teams

Nice to have

Experience with AIOps, Serverless Architectures and event-driven systems
Familiarity with FinOps practices and cost optimization frameworks
Experience with SaaS monitoring tools such as Datadog, New Relic, Sumo Logic or PagerDuty
Exposure to Atlassian tools including Jira, Confluence or Bitbucket
Experience with SQL or NoSQL databases
Showcase of leading cross-functional reliability initiatives or platform-wide automation projects