Skip To Main Content
backBack to Search

Middle Site Reliability Engineer

Amazon Web Services, Amazon EC2, Amazon Elastic Container Service, CI/CD, Terraform, Amazon DynamoDB, Ansible, Bash, Docker, Go Language, JavaScript, New Relic, Node.js, PHP, Python

We are seeking a Middle Site Reliability Engineer with a focus on cost savings and maintenance of systems to join our team.

In this role, you'll be crucial in building and supporting robust, high-capacity systems that are efficient and cost-effective. You'll work within our AWS infrastructure, collaborating with product development teams to enhance automation, improve performance, and ensure the reliability of our systems while optimizing costs.

Responsibilities
  • Implement and refine cloud cost optimization strategies through analysis and resizing recommendations
  • Collaborate with engineering and product teams to create cost-aware architectural solutions
  • Develop, maintain, and optimize dashboards for monitoring cloud expenditures
  • Identify and leverage AWS cost-saving opportunities such as Reserved Instances and Savings Plans
  • Educate and promote a culture of financial responsibility regarding cloud resource usage
  • Design, analyze, and troubleshoot highly distributed large-scale production systems and cloud-based services
  • Support continuity planning including failure injections and validating monitoring configurations
  • Enhance infrastructure scalability plans to handle double the expected load
  • Manage middleware, network, storage, database, and server coordination
  • Conduct performance testing and tuning for optimized system responsiveness
  • Develop and maintain telemetry processes to monitor key operational metrics
Requirements
  • 2+ years of experience as a software engineer developing, debugging, and deploying enterprise applications
  • Proven background reporting on cloud infrastructure costs utilizing tools like AWS Cost Explorer
  • Proficiency in infrastructure automation technologies such as Terraform
  • Capability to manage container orchestration using ECS or Kubernetes
  • Versatile troubleshooting skills across hosting technologies including web servers, operating systems, and network components
  • Skills in continuous deployment frameworks and lifecycle management (e.g., CI/CD)
  • Competency in database operations and deployment with cloud databases like RDS MySQL, Postgres, and Aurora
  • Knowledge of caching strategies for high concurrency workloads
  • Understanding of Lean/Agile deployment processes such as Blue/Green, ZDT, and Canary
  • Familiarity with telemetry SaaS systems including New Relic products like APM and Synthetics
  • Strong problem-solving and root cause analysis capabilities
  • Excellent communication skills and ability to manage culturally aligned escalation response plans
  • English level B2+ for effective communication
Nice to have
  • Bachelor's Degree in Computer Science
  • Ability to communicate across a broad range of technical and non-technical stakeholders
  • Fluency in multiple programming languages including JavaScript, Python, and PHP, among others
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn