Engineering Manager, SRE - Observability

Other

Marii Konopnickiej 29, Kraków

Zendesk

Full-time

Permanent

Manager / C-level

Hybrid

8 188.50 - 10 099.15 USD

Gross per month - Permanent

Job description

As an Engineering Manager specializing in Observability, you will lead and scale a highly skilled team responsible for architecting, building, and evolving enterprise-grade monitoring, alerting, and incident response systems. Leveraging your deep expertise with observability tools such as Datadog, Grafana, Loki, and others, you will drive our transformation from reactive firefighting to proactive reliability engineering at scale. Your mission is to empower engineering teams by providing the right visibility and tooling to ensure system health, availability, and performance.

You will collaborate closely with Product Management and Technical Leads to define and execute a strategic roadmap that addresses the challenges of monitoring complex, large-scale distributed systems in a cloud-native environment. This role demands a hands-on engineering leader who understands the nuances of telemetry data, visualization, alerting reliability, and cost-efficient observability architectures in enterprise settings.

What You’ll Be Doing

Recruit, mentor, and retain top engineering talent specialized in observability and reliability engineering.
Directly contribute to the design and implementation of observability solutions alongside your team, maintaining a high bar for technical excellence.
Own and evolve the end-to-end observability stack and operational processes, including metrics, traces, logs, dashboards, and alerting.
Partner with SRE, DevOps, and platform teams to integrate and extend observability tooling across diverse services running at large scale.
Lead roadmap planning for observability infrastructure and tooling in partnership with Product and Engineering leadership.
Establish best practices for instrumentation, data collection, alerting thresholds, and incident response workflows to elevate the organization's reliability posture.
Identify gaps and weaknesses in monitoring coverage and performance; proactively drive improvements and automation.
Collaborate cross-functionally with teams across the enterprise to influence observability adoption, standardization, and innovation.
Foster a culture of continuous learning, high team engagement, and technical craftsmanship within your team.
Communicate technical strategy, progress, risks, and impact effectively with stakeholders at all levels.

What You Bring to the Role

Deep hands-on experience with commercial and open-source observability tools, including Datadog, Grafana, Loki, and related telemetry technologies.
Proven track record managing observability or SRE teams within large, complex enterprise environments.
Strong understanding of distributed systems, cloud-native architectures (Kubernetes, AWS), and how observability fits into scalable operations.
Ability to provide technical leadership while actively contributing to engineering solutions and troubleshooting.
Expertise in designing scalable, reliable telemetry pipelines and intelligent alerting to reduce alert noise and incident toil.
Demonstrated skill in building and improving observability platforms that serve multiple engineering teams and business units.
Effective communicator and collaborator, able to bridge engineering, product, and business stakeholders.
Commitment to developing team members through coaching, feedback, and career growth opportunities.
Experience driving cultural change in organizations towards proactive reliability engineering and data-driven decision making.

Required

3+ years of people management experience leading engineering teams.
Deep domains expertise in Observability with hands-on experience in tools like Datadog, Grafana, Loki, etc.
Significant experience working in or managing engineering teams within large-scale enterprise companies.
Proven ability to hire, mentor, and retain high-performing engineers.
Strong collaboration skills to influence cross-functional teams in large engineering organizations.
Experience with distributed systems and cloud environments (AWS, Kubernetes).

Preferred

Background leading Observability focused teams.
Hands-on experience operating telemetry systems for large-scale Kubernetes and AWS workloads.
Passion for innovation, continuous learning, and championing a growth mindset.
Experience managing geographically distributed teams.

Our Tech Environment

Primarily AWS cloud infrastructure with Kubernetes orchestration.
Codebase spans Ruby, Go, and Python.
Data storage includes AWS Aurora (MySQL), S3, and Kafka streaming.
Observability responsibilities include balancing operational maintenance, tooling innovation, and incident support.

Tech stack

English

Observability

master

People Management

regular

Site Reliability Engineer (SRE)

regular

Office location

Published: 30.12.2025

About the company

Zendesk

Zendesk is redefining customer and employee experience. Our AI-powered solutions help over 100,000 companies build better relationships and grow. We push boundaries of what’s possible and create tech that brings people c...

Company profile

Check similar offers