Senior Site Reliability Engineer
As a recruitment company, DCG understands that every business is powered by experienced professionals. Our management style and partnership approach enable us to meet your needs and provide continuous support. Due to our ongoing growth and the large number of recruitment projects we undertake for our partners, we are currently looking for: Senior Site Reliability Engineer
Responsibilities:
Building and maintaining a central operational "control tower" for AI applications and pipelines
Designing and implementing monitoring, alerts, and dashboards (signals, thresholds, routing, runbooks)
Incident response: triage, coordination, root cause analysis, post-mortems, and preventive measures
Standardization of pipeline telemetry (success/failure, latency, throughput, bottlenecks)
CI/CD optimization – release quality, automated testing, reliability gates
Collaboration with engineering teams to reduce the number of recurring incidents
Requirements:
Proactive and self-driven – identifies problems, risks, and opportunities for improvement on their own; doesn't wait for detailed instructions
Engaged owner mindset – treats system stability as their end-to-end responsibility
Hands-on engineer – regularly works with clusters, pipelines, monitoring, and code
AI-native – uses AI tools extensively on a daily basis (Copilot, LLMs, automation, analytics, debugging, documentation) and understands how AI impacts system design and maintenance
Comfortable working in a dynamic environment with processes that are not yet fully mature
Experience with Azure DevOps (Boards, Repos, Pipelines)
Strong knowledge of Kubernetes, including troubleshooting, scaling, and production operations
Proficiency in Datadog (metrics, logs, dashboards, alerting)
Experience with Azure Portal for environment operations and configuration
Strong knowledge of CI/CD practices, including pipeline optimization, testing, and quality gates
5+ years of experience as an SRE / Production / Platform Engineer
Proven experience in production environments
Strong knowledge of incident management and root cause analysis (RCA)
Ability to build practical, rather than theoretical, monitoring systems
Very good command of English, both spoken and written
Nice to have:
Experience with Grafana
Experience with AI/LLM pipelines and their observability
Building multi-app monitoring platforms
Working in scaled Kubernetes environments (AKS or similar)
Offer:
Private medical care
Co-financing for the sports card
Training & learning opportunities
Constant support of dedicated consultant
Employee referral program

DCG
DCG to przestrzeń, w której spotykają się potrzeby biznesu i ambicje ludzi. Znamy wartość dobrze dopasowanej współpracy, dlatego pomagamy kandydatom znaleźć środowisko, w którym będą mogli rozwinąć skrzydła, a firmom - z...
Senior Site Reliability Engineer
Senior Site Reliability Engineer