Middle Site Reliability Engineer
Amazon Web Services, Amazon EC2, Amazon Elastic Container Service, CI/CD, Terraform, Amazon DynamoDB, Ansible, Bash, Docker, Go Language, JavaScript, New Relic, Node.js, PHP, Python
We are seeking a Middle Site Reliability Engineer with a focus on cost savings and maintenance of systems to join our team.
In this role, you'll be crucial in building and supporting robust, high-capacity systems that are efficient and cost-effective. You'll work within our AWS infrastructure, collaborating with product development teams to enhance automation, improve performance, and ensure the reliability of our systems while optimizing costs.
Responsibilities
- Implement and refine cloud cost optimization strategies through analysis and resizing recommendations
- Collaborate with engineering and product teams to create cost-aware architectural solutions
- Develop, maintain, and optimize dashboards for monitoring cloud expenditures
- Identify and leverage AWS cost-saving opportunities such as Reserved Instances and Savings Plans
- Educate and promote a culture of financial responsibility regarding cloud resource usage
- Design, analyze, and troubleshoot highly distributed large-scale production systems and cloud-based services
- Support continuity planning including failure injections and validating monitoring configurations
- Enhance infrastructure scalability plans to handle double the expected load
- Manage middleware, network, storage, database, and server coordination
- Conduct performance testing and tuning for optimized system responsiveness
- Develop and maintain telemetry processes to monitor key operational metrics
Requirements
- 2+ years of experience as a software engineer developing, debugging, and deploying enterprise applications
- Proven background reporting on cloud infrastructure costs utilizing tools like AWS Cost Explorer
- Proficiency in infrastructure automation technologies such as Terraform
- Capability to manage container orchestration using ECS or Kubernetes
- Versatile troubleshooting skills across hosting technologies including web servers, operating systems, and network components
- Skills in continuous deployment frameworks and lifecycle management (e.g., CI/CD)
- Competency in database operations and deployment with cloud databases like RDS MySQL, Postgres, and Aurora
- Knowledge of caching strategies for high concurrency workloads
- Understanding of Lean/Agile deployment processes such as Blue/Green, ZDT, and Canary
- Familiarity with telemetry SaaS systems including New Relic products like APM and Synthetics
- Strong problem-solving and root cause analysis capabilities
- Excellent communication skills and ability to manage culturally aligned escalation response plans
- English level B2+ for effective communication
Nice to have
- Bachelor's Degree in Computer Science
- Ability to communicate across a broad range of technical and non-technical stakeholders
- Fluency in multiple programming languages including JavaScript, Python, and PHP, among others
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn