Lead Site Reliability Engineer
Argentina
We are looking for an experiencedbto join our team and play a key role in ensuring the stability, scalability, and performance of our systems. This position involves improving infrastructure, enhancing automation processes, and maintaining optimal functionality across distributed systems and cloud environments. You will collaborate with diverse teams, drive technical initiatives, and provide mentorship to foster a culture of innovation and operational excellence.
Responsibilities
- Enhance the performance and reliability of Linux-based systems used for production services and distributed environments
- Implement advanced monitoring solutions with tools such as Splunk, Grafana, and Prometheus to strengthen system observability
- Resolve complex Kubernetes-related issues and establish guidelines and best practices for the team
- Create and maintain automation workflows using Bash and Python to optimize operational efficiency
- Develop and manage container orchestration platforms like Kubernetes or EKS while sharing knowledge with the team
- Design robust cloud architecture with AWS to ensure reliability and scalability of infrastructure
- Champion automation efforts to streamline processes and reduce manual workloads
- Provide leadership by promoting collaboration, accountability, and effective communication within the team
- Support continuous learning and development within the team to encourage growth and innovation
- Offer mentorship and technical expertise to team members to enhance operational practices and communication
- Plan and execute disaster recovery strategies and capacity management to maintain system resilience
- Automate deployment processes using tools like Terraform or CloudFormation to improve team productivity
- Incorporate open-source technologies such as Cassandra, Kafka, Solr, Postgres, and Redis to strengthen SRE practices
Requirements
- Bachelor’s degree in Computer Science, a related technical field, or equivalent hands-on experience
- Five or more years of experience as a Site Reliability Engineer
- At least one year of experience guiding and managing technical teams
- Proficiency in Bash for scripting and automation tasks to enhance workflows
- Experience with Grafana for monitoring and system performance visualization
- Advanced knowledge of Linux systems and their optimization for production environments
- Familiarity with Microsoft Internet Information Services (IIS) for managing web server frameworks
- Proficiency in Prometheus for distributed system monitoring and alerting
- Experience with Python for developing automation solutions and improving operational processes
- Fluency in English, both written and spoken, at a B2 level or higher
Nice to have
- Experience with Amazon Web Services (AWS) for designing scalable cloud solutions
- Knowledge of cloud platforms and their integration into infrastructure design
- Expertise in Kubernetes for orchestrating and managing containerized applications
- Familiarity with Splunk for telemetry and log management
- Hands-on knowledge of Terraform and Terraform Cloud for automating infrastructure deployment
- Strong skills in troubleshooting and resolving complex system issues
Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn