Senior Site Reliability Engineer

Site Reliability Engineering, CA Release Automation, Certificate authority, IBM AIX, Firewalls, Load Balancing Tools, REST API, RedHat Satellite, Requirements and Change management, Test planning & reporting, VM

Facebook LinkedIn Send via email

We are seeking a highly experienced Senior Site Reliability Engineer to join our remote team, working on a complex and challenging project that involves developing and maintaining highly resilient applications. As a Senior Site Reliability Engineer, you will be responsible for ensuring the reliability, availability, and performance of our applications, collaborating with Development and Operations teams to align on requirements and driving SDLC capabilities and limitations. You will also be responsible for evaluating and implementing orchestration, automation, and tooling solutions to ensure consistent processes and repetitive tasks are performed with a higher level of accuracy and reduced defects.

Responsibilities

Enable creation and updating of logging standards to streamline dashboard creation and ensure usability of logging repository
Drive monitoring requirements to ensure business-service level visibility for all support teams
Provide guidance to software engineers related to design patterns that are resistant to failure
Communicate effectively with Development and Operation teams to align on requirements, driving SDLC requirements, capabilities, and limitations pertinent to delivering highly resilient applications
Evaluate and implement orchestration, automation, and tooling solutions to ensure consistent processes and repetitive tasks are performed with a higher level of accuracy and reduced defects
Build, implement and advise on recovery tooling to adhere to enterprise standards and/or frameworks
Introduce new and impactful technologies to the production support tool chain that help minimize friction for production releases and support
Ensure application data flows are accurate and up to date with the objective to increase the knowledge base of all support teams and drive reliability
Facilitate the resolutions of non-application issues (3rd party upstream issues, infrastructure issues, storage, database, network, file transfer etc.)
Participate in architectural decisions to ensure software transaction flows are appropriately supported and designed

Requirements

A minimum of 3 years of experience in Site Reliability Engineering, demonstrating your expertise in developing and maintaining highly resilient applications
In-depth knowledge of CA Release Automation, Certificate authority, IBM AIX, Firewalls
Experience with Load Balancing Tools, REST API, RedHat Satellite
Knowledge of Requirements and Change management, Test planning & reporting, VM, and other relevant tools and technologies
Experience in driving monitoring requirements to ensure business-service level visibility for all support teams
Strong experience in evaluating and implementing orchestration, automation, and tooling solutions
Strong experience in availability, proactive monitoring / alerting, capacity planning, performance (reducing latency and increasing efficiency) to include testing for technical platforms
Excellent communication skills and the ability to communicate effectively with Development and Operation teams to align on requirements and drive SDLC capabilities and limitations
Fluent spoken and written English at an Upper-Intermediate level or higher

Benefits

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn