Skip To Main Content
backBack to Search

Lead DevOps Engineer

DevOps, Dynatrace, Grafana, Splunk

We are seeking an experienced Lead DevOps Engineer to join our team, focusing on incident and request management, with proficiency in tools such as Dynatrace, Grafana, and Splunk.

This role demands expertise in setting up monitoring systems and administering tools, along with the capability to address medium complexity break/fix tickets. If you are a strategic thinker adept at ensuring high availability and fault tolerance in systems, we invite you to apply.

Responsibilities
  • Develop and maintain documentation that outlines best practices for logging and monitoring
  • Perform regular audits to verify adherence to policies and industry standards
  • Participate in cross-functional discussions to foster logging and monitoring best practices throughout the company
  • Oversee monitoring, alerting, operability, and observability using Dynatrace, Splunk, and Grafana
  • Triage, update, and assess ticket urgency
  • Review documentation to escalate tickets that exceed Level 2 troubleshooting capabilities
  • Utilize documentation to address standard incidents and requests
  • Determine average time to complete tickets and establish SLOs for each product request type
  • Regularly document and review metrics and escalated tickets to improve the support process
  • Handle incidents and requests for monitoring setup and tool administration using JIRA
  • Provide off-hours monitoring, escalation, and carry pager duty during emergencies
Requirements
  • Over 5 years of experience in DevOps or SRE roles
  • Bachelor’s degree in computer science or a related field and/or equivalent work experience
  • Expertise in leading a diverse team and encouraging collaboration
  • Strong understanding of observability including monitoring, logging, and tracing
  • Hands-on experience with Dynatrace, Splunk, Grafana
  • Background in Azure logging and monitoring tools such as Log Analytics, Azure Monitor, App Insights
  • Capacity to work both independently and as part of a team
  • Strong analytical and problem-solving skills, with expertise in troubleshooting under pressure
  • Strategic thinker with exceptional organizational and interpersonal skills
  • Flexibility in adapting quickly to new technologies
  • Outstanding communication skills and fluency in English
Nice to have
  • Proven track record in managing high-availability, fault-tolerant, scalable systems in production environments
Benefits
  • International projects with top brands
  • Work with global teams of highly skilled, diverse peers
  • Healthcare benefits
  • Employee financial programs
  • Paid time off and sick leave
  • Upskilling, reskilling and certification courses
  • Unlimited access to the LinkedIn Learning library and 22,000+ courses
  • Global career opportunities
  • Volunteer and community involvement opportunities
  • EPAM Employee Groups
  • Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn