Skip To Main Content
backBack to Search

Senior Site Reliability Engineer

Remote in Vietnam: Ho Chi Minh City
Site Reliability Engineering
hot
Looking for something else?

Find a vacancy that works for you. Send us your CV to receive a personalized offer.

Find me a job

EPAM Vietnam is hiring a Senior Site Reliability Engineer to support and stabilize a complex, business-critical environment. This is a hands-on, high-ownership role responsible for production incidents, releases, monitoring, alerting and operational excellence.

You will work across Linux, Windows, SQL Server, CI/CD, Kubernetes and Azure while supporting both modern cloud workloads and legacy business-critical systems.

Responsibilities
  • Own production incidents end-to-end, from triage to fix and follow-up
  • Troubleshoot Linux & Windows systems, services and databases
  • Operate and improve monitoring and alerting tools
  • Support batch workflows and schedulers
  • Work across production and disaster recovery environments
  • Improve runbooks, alert quality and operational processes
Requirements
  • Strong experience in production operations, SRE or infrastructure support
  • Proven expertise in troubleshooting Linux and Windows production systems and operational knowledge of Microsoft SQL Server diagnostics
  • Experience with CI/CD pipelines and deployments (e.g., Octopus Deploy, TeamCity and Git/Bitbucket)
  • Proficiency in monitoring and alerting tools (e.g., Prometheus and Grafana)
  • Familiarity with batch scheduling tools (e.g., Control-M and TeamCity) and messaging systems (e.g., RabbitMQ)
  • Working knowledge of Kubernetes and Azure cloud environments
  • Clear communication for incident management and stakeholder interaction
  • Strong sense of ownership, sound judgment in escalation and a proactive approach to production reliability