Senior Site Reliability Engineer

DevOps

Towarowa 28, Warszawa

DCG

Go to company profile

Full-time

B2B

Senior

Remote

Job description

As a recruitment company, DCG understands that every business is powered by experienced professionals. Our management style and partnership approach enable us to meet your needs and provide continuous support. Due to our ongoing growth and the large number of recruitment projects we undertake for our partners, we are currently looking for: Senior Site Reliability Engineer

Responsibilities:

Building and maintaining a central operational "control tower" for AI applications and pipelines
Designing and implementing monitoring, alerts, and dashboards (signals, thresholds, routing, runbooks)
Incident response: triage, coordination, root cause analysis, post-mortems, and preventive measures
Standardization of pipeline telemetry (success/failure, latency, throughput, bottlenecks)
CI/CD optimization – release quality, automated testing, reliability gates
Collaboration with engineering teams to reduce the number of recurring incidents

Requirements:

Proactive and self-driven – identifies problems, risks, and opportunities for improvement on their own; doesn't wait for detailed instructions
Engaged owner mindset – treats system stability as their end-to-end responsibility
Hands-on engineer – regularly works with clusters, pipelines, monitoring, and code
AI-native – uses AI tools extensively on a daily basis (Copilot, LLMs, automation, analytics, debugging, documentation) and understands how AI impacts system design and maintenance
Comfortable working in a dynamic environment with processes that are not yet fully mature
Experience with Azure DevOps (Boards, Repos, Pipelines)
Strong knowledge of Kubernetes, including troubleshooting, scaling, and production operations
Proficiency in Datadog (metrics, logs, dashboards, alerting)
Experience with Azure Portal for environment operations and configuration
Strong knowledge of CI/CD practices, including pipeline optimization, testing, and quality gates
5+ years of experience as an SRE / Production / Platform Engineer
Proven experience in production environments
Strong knowledge of incident management and root cause analysis (RCA)
Ability to build practical, rather than theoretical, monitoring systems
Very good command of English, both spoken and written

Nice to have:

Experience with Grafana
Experience with AI/LLM pipelines and their observability
Building multi-app monitoring platforms
Working in scaled Kubernetes environments (AKS or similar)

Offer:

Private medical care
Co-financing for the sports card
Training & learning opportunities
Constant support of dedicated consultant
Employee referral program

Tech stack

English

Datadog

advanced

CI/CD

advanced

AI

advanced

Grafana

advanced

Kubernetes

advanced

Copilot

regular

Office location

About the company

DCG

DCG to przestrzeń, w której spotykają się potrzeby biznesu i ambicje ludzi. Znamy wartość dobrze dopasowanej współpracy, dlatego pomagamy kandydatom znaleźć środowisko, w którym będą mogli rozwinąć skrzydła, a firmom - z...

Company profile

Check similar offers