We are seeking an experienced Site Reliability Engineer (SRE) to help build, harden, and scale our CDN and Web Application Firewall (WAF) product. You will be responsible for implementing new features, optimizing performance, and improving security controls across our edge stack. This role requires deep hands-on expertise in Nginx/OpenResty, Lua, C/FFI module development, eBPF, Linux networking and infrastructure-as-code tooling.
responsibilities
Design, implement and ship features and improvements for our CDN and WAF edge stack
Develop and maintain Lua code running in Nginx/OpenResty and build high-performance C modules or FFI bindings where needed
Implement packet- and kernel-level observability or filtering using eBPF (including XDP/eBPF tracing for telemetry and enforcement)
Tune and troubleshoot high-volume Nginx deployments for latency, throughput and memory usage
Define, author and maintain WAF rule logic, request/response inspection and mitigation workflows
Build automation for deployment and configuration using Infrastructure-as-Code (Ansible, Puppet, Terraform, or similar)
Work with networking protocols and operational requirements of a CDN: BGP, anycast, TCP/IP stack, load balancing, connection handling
Create and run performance/load tests, fuzzing and security tests; profile and optimize hotspots
Produce clear design documentation, runbooks, and hand over completed work to operations. Participate in code reviews and mentor engineers
Collaborate with product, security and SRE teams to align feature work with product goals and SLAs
requirements
5+ years of production experience in Linux systems engineering, networked services or edge infrastructure
Strong hands-on experience with Nginx and Lua (ngx_lua/OpenResty), including writing Lua modules for request processing
Familiarity or experience building native C modules or FFI bindings used by Nginx/Lua; comfortable with libc, POSIX APIs and building/packaging C extensions
Practical experience with eBPF (tools, BCC/libbpf, XDP) for telemetry, filtering or tracing
Deep knowledge of networking and TCP/IP internals, load balancing, and CDN operational patterns. Familiarity with BGP and anycast is a plus
Experience with Web application firewall, either appliance, service or software and its capabilities
Solid Linux kernel and userland troubleshooting skills: perf, tcpdump/Wireshark, strace, systemtap
Experience with Infrastructure-as-Code and configuration management (Ansible, Puppet, Chef, Terraform or similar)
Experience deploying and maintaining WAF rulesets and policies; understanding of OWASP top risks and typical web attack patterns
Experience with testing and benchmarking tools (wrk, ab, locust, etc.) and CI/CD pipelines
Excellent communication skills; able to work independently and collaborate effectively with distributed teams
English level B1+ for effective communication
nice to have
Prior experience building or operating CDNs or edge platforms
Familiarity with web security tooling, such as ModSecurity, or other WAF platforms
Experience with container workflows and edge deployment (e.g., Docker, HashiCorp Nomad)
Exposure to cloud providers’ networking (AWS/GCP/Azure) and hybrid edge deployments
Familiarity with observability stacks: Prometheus, Grafana, ELK/EFK
Experience in cross-compiling or packaging modules for multiple Linux distributions
Experience programming in Lua and C, with familiarity using LuaJIT and the Lua FFI for native, high-performance integrations
We are looking for a skilled Cloud Engineering Manager to drive the strategic direction of cloud infrastructure and DevOps practices within our organization. This position requires a blend of leadership, technical expertise, and operational excellence to oversee high-performing teams and deliver scalable, reliable, and resilient systems for mission-critical applications.
responsibilities
Oversee the design and implementation of infrastructure and product monitoring systems
Define and utilize SLI/SLO standards to strengthen reliability and system performance
Conduct root cause analysis to identify and address system inefficiencies
Facilitate postmortem analyses and drills to optimize incident response strategies
Evaluate product performance, scalability, and reliability to ensure operational excellence
Automate operational tasks to optimize workflows and improve productivity
Deploy CI/CD pipelines and champion modern DevOps practices
Manage cloud infrastructure and configuration through Infrastructure-as-Code tools
Collaborate with cross-functional teams to align cloud strategies with business goals
Support the growth of engineers through mentoring and professional development initiatives
Plan and execute staffing strategies to build a scalable and efficient engineering team
requirements
Minimum of 7 years of experience in cloud engineering, DevOps, or SRE
Expertise in scripting languages such as Python, Go, Bash, or PowerShell
Proficiency in observability tools including Prometheus, Grafana, DataDog, and ELK
Background in cloud infrastructure management tools like Terraform and cloud-specific CLI tools (gcloud, az, aws)
Skills in configuration management tools such as Ansible
Knowledge of CI/CD platforms including Jenkins (Groovy SDK, Jenkinsfile), GitLab-CI, or Azure DevOps
Expertise in containerization technologies such as Docker and Kubernetes
Strong ability to analyze incidents and develop strategies for improvement
Capability to design scalable, cloud-native solutions in alignment with business objectives
nice to have
Familiarity with hybrid cloud environments and multi-cloud architectures
Background in integrating machine learning workloads into cloud ecosystems
Showcase of working with observability stacks tailored for microservices architecture
Understanding of serverless cloud platforms and associated workflows
We are seeking a highly skilled and motivated Senior Site Reliability Engineer to be a key member of our team, driving operational excellence and improving the reliability, scalability, and performance of our infrastructure and product services.
responsibilities
Provide L3 on-call support, ensuring rapid response to incidents
Define and implement effective SLI/SLO metrics for product monitoring
Perform detailed root cause analysis to resolve critical issues
Conduct postmortems and organize drills to improve readiness
Analyze product performance, scalability, and reliability to optimize service delivery
Automate operational tasks to reduce manual intervention
Implement CI/CD pipelines using tools like Jenkins, Gitlab-CI, or Azure DevOps
Manage cloud infrastructure and configurations to support Infrastructure-as-Code initiatives
Utilize configuration management tools such as Ansible to maintain consistency across environments
Collaborate closely with cross-product teams and business stakeholders to align reliability goals with project objectives
requirements
5+ years of experience working in Site Reliability Engineering or similar roles
Intermediate knowledge of scripting languages such as Python, Go, Bash, or Powershell
Solid knowledge of cloud platforms, including AWS, Azure, or GCP
Familiarity with observability tools such as Prometheus, Grafana, DataDog, ELK, or Zabbix
Expertise in cloud infrastructure management tools, including Terraform and one of the cloud CLIs (gcloud, az, aws)
Proficiency in containerization technologies like Docker and Kubernetes (K8s)
Capability to define and monitor SLI/SLO metrics for system reliability
Thorough understanding of postmortem and drill procedures to enhance incident handling processes
B2-level English proficiency in both speaking and writing
nice to have
Showcase of implementing CI/CD pipelines using Groovy SDK or Jenkinsfile
Background in working with large-scale production systems requiring high availability
Familiarity with advanced monitoring practices using tools such as Dynatrace
Skills in scaling Kubernetes clusters and optimizing containerized applications
Flexibility to use diverse scripting languages to automate complex workflows
We are seeking a skilled and experienced Lead Site Reliability Engineer to join our dynamic team, ensuring the performance, scalability, and reliability of our production systems and infrastructure. If you're a proactive problem-solver with a strong background in monitoring, automation, and cloud technologies, we want to hear from you.
responsibilities
Provide L3 on-call support as needed
Design and develop monitoring systems for infrastructure and products
Define and implement SLI/SLOs for system reliability tracking
Conduct thorough root cause analyses for incidents
Lead postmortem procedures and drills for continuous improvement
Analyze product performance, scalability, and reliability
Automate operational tasks to enhance efficiency
Implement and manage CI/CD pipelines following "as-Code" practices
Oversee cloud infrastructure and configuration management using Infrastructure-as-Code principles
Collaborate closely with cross-product teams and business stakeholders to align reliability objectives
requirements
5+ years of relevant experience, including 1 year in a leadership role
Advanced knowledge of scripting languages such as Python, Go, Bash, or Powershell
Expertise in any major cloud platform (AWS, GCP, or Azure)
Proficient in optimizing monitoring and logging tools like DataDog, Dynatrace, Prometheus, Grafana, Zabbix, or ELK
Capability to manage cloud infrastructure using tools like Terraform and command-line interfaces (gcloud, az, aws)
Competency in configuration management using Ansible
Background in CI/CD toolchains such as Jenkins (Groovy SDK, Jenkinsfile), GitLab-CI, or Azure DevOps
Understanding of containerization technologies such as Docker and Kubernetes
Exceptional troubleshooting and problem-solving abilities, including reconstructing incident conditions and flows based on root cause analysis
B2-level English proficiency, both in speaking and writing
nice to have
Familiarity with multiple cloud-native monitoring tools
Showcase of leading cross-functional team collaborations
We’re seeking a skilled DevOps/SRE with extensive expertise in designing, implementing, and maintaining observability platforms to ensure system reliability, performance, and scalability. As a vital member of our SRE team, you will promote the adoption of observability best practices, fostering proactive monitoring, swift incident resolution, and continuous enhancements to our software products and infrastructure. This role emphasizes creating and refining observability solutions—including metrics, logs, and traces—to provide actionable insights into system health and performance. You'll also advance automation for deployment pipelines, oversee applications across various environments, and ensure our systems meet rigorous reliability and availability expectations. Collaboration will be essential as you engage closely with development teams to integrate observability into the software lifecycle, equipping them with the tools and practices for efficient debugging and iteration.
responsibilities
Architect and implement observability platforms using tools like Prometheus, Grafana, and OpenTelemetry to support our Next.js frontend and accompanying systems
Design and maintain automated deployment pipelines focused on reliability, observability, and zero-downtime updates across multiple environments
Collaborate with development teams to integrate observability into local workflows for accelerated debugging and iteration
Optimize infrastructure and tools for scalability, fault tolerance, and performance with the aim of reducing mean time to detection (MTTD) and resolution (MTTR)
Mentor team members in SRE practices, including observability-driven development, incident management, and post-mortem analyses
requirements
Proficiency in scripting languages like Python for automation and observability tools
Expertise in observability frameworks (e.g., Prometheus, Grafana, Loki, Jaeger) and logging solutions (e.g., ELK stack, Fluentd)
Background in containerization technologies (e.g., Docker) and orchestration platforms (e.g., Kubernetes, AWS ECS)
Knowledge of infrastructure as code tools (e.g., Terraform, Ansible) to provision and manage observable systems
Familiarity with version control systems, especially Git, and integrating observability into CI/CD pipelines (e.g., Jenkins, GitHub Actions)
Capability to define and measure service-level indicators (SLIs), objectives (SLOs), and error budgets to ensure system reliability
Competency in fostering collaboration and communication, with a strong commitment to nurturing a blameless culture of improvement
nice to have
Proficiency in Polish language
Proficiency in programming languages as applied to SRE, DEVOPS, or observability contexts
Familiarity with cloud platforms, such as AWS, with a focus on observability services (e.g., CloudWatch, X-Ray)
Understanding of distributed systems, chaos engineering, or security practices in observable environments
We are looking for a skilled Cloud Engineering Manager to drive the strategic direction of cloud infrastructure and DevOps practices within our organization. This position requires a blend of leadership, technical expertise, and operational excellence to oversee high-performing teams and deliver scalable, reliable, and resilient systems for mission-critical applications.
responsibilities
Oversee the design and implementation of infrastructure and product monitoring systems
Define and utilize SLI/SLO standards to strengthen reliability and system performance
Conduct root cause analysis to identify and address system inefficiencies
Facilitate postmortem analyses and drills to optimize incident response strategies
Evaluate product performance, scalability, and reliability to ensure operational excellence
Automate operational tasks to optimize workflows and improve productivity
Deploy CI/CD pipelines and champion modern DevOps practices
Manage cloud infrastructure and configuration through Infrastructure-as-Code tools
Collaborate with cross-functional teams to align cloud strategies with business goals
Support the growth of engineers through mentoring and professional development initiatives
Plan and execute staffing strategies to build a scalable and efficient engineering team
requirements
Minimum of 7 years of experience in cloud engineering, DevOps, or SRE
Expertise in scripting languages such as Python, Go, Bash, or PowerShell
Proficiency in observability tools including Prometheus, Grafana, DataDog, and ELK
Background in cloud infrastructure management tools like Terraform and cloud-specific CLI tools (gcloud, az, aws)
Skills in configuration management tools such as Ansible
Knowledge of CI/CD platforms including Jenkins (Groovy SDK, Jenkinsfile), GitLab-CI, or Azure DevOps
Expertise in containerization technologies such as Docker and Kubernetes
Strong ability to analyze incidents and develop strategies for improvement
Capability to design scalable, cloud-native solutions in alignment with business objectives
nice to have
Familiarity with hybrid cloud environments and multi-cloud architectures
Background in integrating machine learning workloads into cloud ecosystems
Showcase of working with observability stacks tailored for microservices architecture
Understanding of serverless cloud platforms and associated workflows
We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. In this critical role, you will collaborate closely with software developers and operations teams to ensure high reliability, scalability, and efficiency of our systems, with a strong focus on meeting and exceeding customer expectations. Your expertise will be crucial in deploying, maintaining, and automating our infrastructure and application environments to ensure seamless user experiences. Your proactive involvement will be key to enhancing system reliability, optimizing resource utilization, and ensuring continuous improvement in our operational practices. Your responsibilities will include defining and tracking Service Level Objectives (SLOs), managing error budgets, and reducing toil through automation. You will play a pivotal role in driving the success of technology initiatives, maximizing their impact across the organization, and ensuring that solutions consistently meet the high standards our customers expect.
responsibilities
Collaborate with development security quality and operation teams to implement SRE practices and ensure system reliability
Define and support the required level of reliability availability and performance for services and applications
Design and deliver Cloud-based solutions tailored to client needs
Troubleshoot mitigate and support fixing of the infrastructure and application issues in a timely manner
Implement a monitoring system for the infrastructure and application reliability
Communicate technical concepts clearly to both engineering teams and management stakeholders
requirements
Bachelor’s degree in Computer Science Engineering or a related field
Proven experience in any cloud AWS/GCP/Azure
Experience with implementing SRE practices such as SLO/SLI Error budgets Postmortems Reducing Toil capacity planning and Incident Management
Python or other scripting/programming language
Strong background in monitoring tools
Proficient in CI/CD tools infrastructure as code and configuration management
Solid knowledge of container orchestration technologies Kubernetes Docker
nice to have
Expertise in deployment and management of LLMs including technologies like RAG
Certification in Kubernetes AWS/GCP/Azure or similar technologies
Proven experience in DevOps
Knowledge of managing and optimizing AI/ML models in production environments including basic deployment monitoring and maintenance
We are seeking a highly skilled Lead Site Reliability Engineer to join our team in driving system reliability, scalability, and performance in complex cloud and containerized environments. This is a unique opportunity to lead critical infrastructure initiatives, foster operational excellence, and collaborate across teams to achieve business objectives.
responsibilities
Design comprehensive monitoring and logging systems using tools like DataDog, Dynatrace, Prometheus, Grafana, Zabbix, and ELK to ensure robust observability
Define and manage SLIs and SLOs to measure and enhance system performance, reliability, and scalability
Lead root cause analysis during incident responses, ensure detailed postmortem evaluations, and develop long-term preventive strategies
Implement infrastructure as code (IaC) using Terraform and cloud CLI (AWS, Azure, GCP) for streamlined management and consistency
Automate workflows and CI/CD pipelines leveraging tools such as Jenkins (Groovy SDK), GitLab CI, and Azure DevOps
Manage containerized environments with expertise in Docker and Kubernetes orchestration for seamless application deployment
Collaborate with engineering and DevOps teams to standardize observability practices and proactively address issues before they escalate
Lead and facilitate post-incident reviews and operational drilling exercises to identify areas for improvement and increase system resilience
Focus optional on-call support hours for rapid issue resolution and the maintenance of system stability
requirements
Residence in Ukraine, with remote work eligibility limited to candidates based within the country
Advanced proficiency in scripting automations with Python, Go, Bash, or PowerShell
Strong knowledge of monitoring systems and tools like Prometheus, Grafana, DataDog, Dynatrace, Zabbix, or ELK
Experience with cloud platforms (AWS, Azure, or GCP) and expertise in IaC with Terraform
Solid understanding of configuration management systems like Ansible
Background in automating CI/CD pipelines and delivery lifecycles using Jenkins, GitLab CI, and Azure DevOps
Practical experience deploying and orchestrating applications in Docker and Kubernetes environments
Exceptional problem-solving capability for incident reconstruction and identifying root causes
Proven track record in leading post-incident reviews and operational improvement exercises
Strong collaboration skills to work effectively with engineering teams and stakeholders to maintain reliability and performance
English level B2 or higher
nice to have
Knowledge of advanced security and compliance strategies in observable environments
Familiarity with chaos engineering approaches for resilience and fault tolerance testing
Experience integrating observability into development workflows to accelerate issue resolution
Familiarity with additional cloud monitoring services like AWS CloudWatch, Azure Monitor, or GCP Operations Suite
We are searching for a Senior Service Integration Engineer to collaborate on a long-term assignment (minimum 12 months) within the Store and Supply Chain domain. This role is integral to improving the integration and visibility of distributed IT systems in a complex hybrid cloud and on-premise environment, enabling seamless data flows and robust end-to-end system control for our clients.
responsibilities
Design and implement service integrations across cloud and on-premise environments
Improve data integration and flow across distributed systems
Work with API management solutions and REST-based services (e.g., ServiceNow)
Integrate process management tools to support system alignment
Contribute to the integration of SaaS platforms (e.g., ServiceNow)
Drive observability and monitoring efforts (e.g., Grafana) to improve system visibility
Help establish a clear, end-to-end view of environments and their dependencies
requirements
3+ years of relevant experience
Solid understanding of API management and REST services
Kotlin experience (Java is acceptable)
Experience working with distributed data sources and complex system landscapes
Experience with SaaS platforms (e.g., ServiceNow or similar)
Knowledge of cloud technologies (preferably Azure, but not limited to it)
Strong observability mindset, with experience using tools like Grafana or Prometheus
Ability to work independently and bring structure to complex environments
Proactive and solution-oriented mindset
Excellent command of written and spoken English (B2+ level)
nice to have
Experience with ServiceNow integrations
Exposure to both cloud and on-premise architecture
Knowledge of IT infrastructure and systems interplay
SRE mindset with a strong focus on reliability and performance optimization
We are seeking a highly skilled and motivated Lead Site Reliability Engineer to oversee the reliability, scalability, and security of our cloud-native identity and profile management platform, enabling personalized experiences across various digital touchpoints.
responsibilities
Ensure system reliability, availability, and performance
Automate infrastructure and operational processes using IaC tools like Terraform and CDK (TypeScript)
Develop and maintain CI/CD pipelines using Jenkins and GitHub Actions
Set up and enhance observability with Prometheus, Grafana, and OpenSearch
Define and monitor SLOs, SLIs, and Error Budgets
Lead incident response, perform root cause analysis, and drive post-mortem reviews
Support Kubernetes deployments and manage Helm charts
Drive scalability and capacity planning efforts
Optimize cloud infrastructure costs while maintaining performance
Ensure security and compliance across systems
Provide documentation and mentorship to foster team growth
Participate in a 24/7 on-call support rotation, estimated at one week per month
requirements
5+ years of experience in Site Reliability Engineering or related fields
Knowledge of AWS, including Serverless (Lambda, Step Function, EventBridge), IAM, CloudWatch, and Networking Services
Proficiency in IaC tools such as Terraform and CDK (TypeScript)
Expertise in CI/CD tools like Jenkins and GitHub Actions
Competency in monitoring and observability tools such as Prometheus, Grafana, and OpenSearch
Background in Kubernetes and Helm for container orchestration
Capability to lead incident management and operational excellence
Strong communication skills and fluency in English (B2+)
nice to have
Familiarity with GitHub Actions for building CI/CD pipelines
Understanding of Helm chart management
Skills in TypeScript
Let us find a perfect job for you
Share your CV and pass our review to get a personalized job offer even if you didn't find a job on the site.