Site Reliability Engineer (SRE)
About the Client
Our client is a premier, global investment management firm operating at the intersection of finance and technology. Known for their sophisticated, data-intensive systems, they build and maintain high-performance platforms that process massive volumes of market and operational data.
To support their expanding footprint, they are looking for a senior-level Site Reliability Engineer (SRE) who will take ownership of shaping, standardizing, and scaling their SRE frameworks and reliability culture from the ground up.
The Role
In this role, you will serve as a foundational force for SRE practices, partnering directly with Cloud, Infrastructure, and Software Engineering squads. You will work across a hybrid infrastructure (combining advanced AWS cloud environments and physical on-premises servers) to guarantee the scalability, resilience, and maximum uptime of critical, high-frequency transactional platforms.
Core Responsibilities
SRE Evangelism: Design, implement, and champion core reliability principles, helping technology teams adopt sustainable scaling practices.
Observability Architecture: Implement, scale, and maintain end-to-end monitoring, telemetry, and distributed tracing systems utilizing Prometheus, Grafana, Loki, and Tempo (OpenTelemetry framework).
Kubernetes Optimization: Establish best-practice configurations for containerized workloads, ensuring applications running on Kubernetes are highly resilient, cost-effective, and performant.
Incident Management & Culture: Participate in a balanced, shared on-call rotation (averaging one week per month).
Automation & Engineering: Build custom tooling and CI/CD pipelines to automate routine tasks, system health checks, and rapid disaster recovery workflows.
SLO/SLA Definition: Partner with product and engineering teams to define, monitor, and enforce Service Level Objectives (SLOs) and Error Budgets.
What We Look For
Experience: 5+ years of hands-on experience in a dedicated SRE, DevOps, or Infrastructure Engineering role supporting complex, distributed production systems.
Education: A Bachelor’s degree in Computer Science, Computer Engineering, or a related technical discipline (or equivalent practical experience).
Observability Expertise: Deep, subject-matter knowledge of modern monitoring stacks, specifically Grafana, Prometheus, Loki, and Tempo (OTel).
Orchestration & Containers: Strong, production-grade expertise in containerization (Docker) and orchestration (Kubernetes).
Hybrid Infrastructure: Experience navigating hybrid models—managing both cloud services (AWS preferred) and physical on-premise hardware resources.
Scripting/Coding: Proficiency in writing clean, maintainable code in at least one scripting or programming language (e.g., Python, Bash, or Go) to build reliable automation.
Methodologies: Solid grounding in CI/CD concepts, infrastructure-as-code (IaC), and agile development processes.
Soft Skills: Excellent verbal and written communication skills, with a proven ability to convey complex infrastructure and reliability concepts to both technical and non-technical stakeholders.
What We Offer
Stable Employment: Full-time employment contract (Umowa o Pracę - UoP).
Tax Optimization: Eligibility for creative tax-deductible costs (KUP - Koszty Uzyskania Przychodu).
Financial Reward: Highly competitive base salary accompanied by a generous annual performance bonus.
Comprehensive Health: Premium private medical care package that fully includes dental coverage (stomatologia).
Wellness & Lifestyle: MultiSport card to keep you active and healthy.
Daily Perks: Pre-funded lunch card for your daily meals.
Tech Stack at a Glance
Cloud & Virtualization: AWS, Kubernetes, Docker, On-Premises Hypervisors
Observability: Prometheus, Grafana, Loki, Tempo, OpenTelemetry (OTel)
Languages: Python, Go, Bash
CI/CD & Automation: Git-based pipelines, Configuration Management, IaC
Site Reliability Engineer (SRE)
Site Reliability Engineer (SRE)