Skip To Main Content
backBack to Search

Senior Site Reliability Engineer

Amazon Web Services, Amazon EC2, Amazon Elastic Container Service, CI/CD, Terraform, Amazon DynamoDB, Ansible, Bash, Docker, Go Language, JavaScript, New Relic, Node.js, PHP, Python

We are in search of a Senior Site Reliability Engineer with a focus on cost savings and advanced system maintenance to join our team.

As a Senior Site Reliability Engineer, you will play a pivotal role in building, supporting, and optimizing high-capacity systems that are both efficient and cost-effective. You will be responsible for sophisticated tasks within our AWS infrastructure, working closely with product development teams to enhance automation, improve system performance, and ensure the reliability of our systems while optimizing costs and resources effectively.

Responsibilities
  • Develop and implement advanced cloud cost optimization strategies through in-depth analysis and resizing recommendations
  • Collaborate with engineering and product teams to engineer sophisticated cost-aware architectural solutions
  • Design, maintain, and optimize comprehensive dashboards for monitoring cloud expenditures in real-time
  • Identify and implement AWS cost-saving strategies including Reserved Instances and Savings Plans with a focus on maximizing financial efficiency
  • Foster a culture of financial prudence and accountability regarding cloud resource usage across engineering teams
  • Design, analyze, and manage troubleshooting strategies for highly distributed large-scale production systems and cloud-based services
  • Lead continuity planning efforts, including failure injections and the validation of effective monitoring configurations
  • Propose and integrate infrastructure scalability enhancements to manage at least double the current expected load
  • Control middleware, network, storage, database, and server coordination on a larger, more complex scale
  • Perform advanced performance testing and tuning to ensure optimized system responsiveness
  • Develop, refine, and oversee telemetry processes to monitor key operational metrics for better decision-making
Requirements
  • 3+ years of experience as a software engineer developing, debugging, and deploying enterprise applications in high-demand environments
  • Strong track record of managing cloud infrastructure costs using tools like AWS Cost Explorer
  • Advanced proficiency in infrastructure automation technologies such as Terraform
  • Expertise in managing container orchestration using ECS or Kubernetes on a large scale
  • Expert troubleshooting skills across hosting technologies including web servers, operating systems, and network components
  • Advanced skills in continuous deployment frameworks and lifecycle management (e.g., CI/CD)
  • Deep understanding of database operations and deployment with cloud databases like RDS MySQL, Postgres, and Aurora
  • Expert knowledge of caching strategies for high concurrency workloads
  • Mastery of Lean/Agile deployment processes such as Blue/Green, ZDT, and Canary
  • Expertise with telemetry SaaS systems including New Relic products like APM and Synthetics
  • Exceptional problem-solving and root cause analysis capabilities with a track record of high-impact solutions
  • Excellent communication skills and ability to manage culturally aligned escalation response plans across different teams
  • English level B2+ for effective communication across global teams
Nice to have
  • Bachelor's or Master's Degree in Computer Science or an equivalent field
  • Advanced ability to communicate across a wide range of technical and non-technical stakeholders
  • Proficiency in multiple programming languages including JavaScript, Python, and PHP, among others
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn