Lead SRE Engineer
Office in India: Chennai
Site Reliability Engineering
& 11 others
Looking for something else?
Find a vacancy that works for you. Send us your CV to receive a personalized offer.
Find me a jobWe are seeking a Lead SRE Engineer to drive the reliability, scalability and performance of our cloud-based platforms. This role is pivotal in leading digital transformation initiatives, enabling product teams and partners to adopt and integrate cloud services with a customer-centric mindset.
Responsibilities
- Lead the design and implementation of highly available and resilient cloud infrastructure
- Define and enforce best practices for Infrastructure-as-Code and automation
- Oversee CI/CD pipeline development and optimization across multiple teams
- Mentor and guide engineers in SRE and DevOps methodologies
- Collaborate with product and platform teams to ensure seamless integration of cloud services
- Establish and maintain robust monitoring, alerting and observability frameworks
- Develop and implement disaster recovery and business continuity strategies
- Drive incident management processes and post-mortem analysis for continuous improvement
- Ensure security, compliance and identity/access management standards are met
- Communicate platform value and reliability initiatives to stakeholders
Requirements
- 8-14 years of experience in Site Reliability Engineering, DevOps or Cloud Infrastructure roles with at least 2 years in a leadership or mentoring capacity
- Deep knowledge of AWS services including EC2, S3, RDS, IAM, VPC, Lambda and CloudFormation or Terraform
- Expertise in Infrastructure-as-Code using Terraform, AWS CDK or CloudFormation
- Proficiency in CI/CD tools such as Jenkins, GitHub Actions or GitLab CI
- Skills in containerization and orchestration with Docker, Kubernetes, ECS or EKS
- Competency in monitoring and observability tools like Datadog, New Relic, Prometheus, Grafana, ELK or CloudWatch
- Background in scripting or programming with Python, Bash or Go
- Understanding of networking, security and identity/access management in cloud environments
- Experience designing high-availability and disaster recovery strategies for critical workloads
- Excellent communication, problem-solving and leadership skills with the ability to influence across teams
Nice to have
- Experience with AIOps, Serverless Architectures and event-driven systems
- Familiarity with FinOps practices and cost optimization frameworks
- Experience with SaaS monitoring tools such as Datadog, New Relic, Sumo Logic or PagerDuty
- Exposure to Atlassian tools including Jira, Confluence or Bitbucket
- Experience with SQL or NoSQL databases
- Showcase of leading cross-functional reliability initiatives or platform-wide automation projects
