Lead Site Reliability Engineer - DevOps
Hybrid in Portugal: Lisbon
Site Reliability Engineering
& 6 others
Choose an option
We are looking for a Lead Site Reliability Engineer to enhance a global execution platform, delivering robust solutions to trading desks and clients.
You will collaborate with expert teams, advancing your expertise in system administration, monitoring, and low-latency technologies. Join us to contribute to cutting-edge financial technology innovations.
Note that working on-site at the client's Lisbon office for 2-3 days per week is required.
Responsibilities
- Design and enforce monitoring, alerting, and incident management strategies
- Automate repetitive tasks and workflows to increase operational efficiency
- Work alongside software engineering teams to build and launch scalable, dependable systems
- Execute production deployments carefully to preserve platform stability
- Handle incident management with thorough analysis and reporting to maintain service quality
- Engage in on-call duties to support essential systems and services
- Communicate clearly with colleagues to swiftly resolve technical problems
- Maintain up-to-date documentation for operational workflows and system settings
- Drive continuous improvements in system reliability and efficiency through proactive initiatives
Requirements
- Deep understanding of Unix/Linux operating systems and networking with over 5 years experience
- Proficiency in Unix/Linux shell scripting and programming languages including Python, Perl, C, C++, or Java
- Experience with monitoring and observability solutions such as ITRS Geneos, Dynatrace, Prometheus, and Grafana
- Strong troubleshooting skills for complex system issues
- Experience in environments with high availability and heavy traffic
- Bachelor’s or Master’s degree in IT engineering or a related discipline
- Ability to collaborate effectively within a team and adapt to evolving environments
- Self-driven with excellent problem-solving capabilities and thorough issue tracking
- Excellent written and verbal communication abilities with English proficiency at B2+ level
Nice to have
- Familiarity with log analysis tools like Splunk, ELK, Graylog, or Loki
- Knowledge of network monitoring solutions such as Corvil
- Experience with relational databases including Oracle, PostgreSQL, MySQL/MariaDB, or KDB/q
- Understanding of messaging platforms like IBM MQ, Tibco, Solace, LBM, or Kafka
- Experience with Infrastructure as Code tools such as Ansible or Terraform