Senior Site Reliability Engineer (Remote)
What’s in store for you:
You’ll be solving complex challenges and maintaining our own infrastructure with 60PB+ monthly data traffic. Here are its scale and maturity in numbers:
- 6PB+ Ceph storage
- 60PB+ monthly data traffic through our systems
- 300k+ service requests/sec processed
- 500k+ Kafka messages/sec streamed
Your day-to-day:
Own and evolve Webshare's production infrastructure - lead the migration from Docker Swarm to Kubernetes (or hybrid K8s + Ansible).
Maintain high availability across hundreds of servers and ~50 services.
Drive observability in cooperation with the development team.
Establish and enforce IaC practices, CI/CD pipeline reliability, and change management processes.
Participate in the on-call rotation alongside backend developers.
Respond to and lead incident resolution, run post-mortems, and drive systematic remediation.
Contribute platform tooling that improves developer experience and reduces infrastructure toil.
Keep backend engineers informed and capable - no silos, shared infrastructure ownership.
Your skills & experiences:
Have built and operated highly available infrastructure at a comparable scale - hundreds of servers, dozens of services, real production load.
Hands-on K8s in self-hosted / bare-metal environments.
Confident with Infrastructure as Code.
Have owned CI/CD pipelines end-to-end (GitLab CI or equivalent).
Have been on call in a production environment.
Proactive - surfaces problems before being asked, keeps the team informed without prompting.
Scripting and development skills.
NICE TO HAVE REQUIREMENTS:
Led at least one major infrastructure migration - planned, executed, and stabilised it.
Python and/or Go familiarity - backend is Python, edge services are Go.
Exposure to proxy, networking-heavy infrastructure.
Previous experience in a small team where developers shared infrastructure responsibility.
Familiarity with edge clusters or split compute/edge architectures.
Please note that only the selected candidates will be contacted for further steps.
It would be greatly appreciated if you could share your LinkedIn profile together with the application.
Up for the challenge? Let’s talk!
Senior Site Reliability Engineer (Remote)
Senior Site Reliability Engineer (Remote)