Site Reliability Engineer (AI Infrastructure)
Key Responsibilities:
Building and maintaining observability for AI workloads, including telemetry, dashboards, alerts, SLO/SLI tracking, and driving improvements when targets are missed.
Writing automation and tooling to reduce operational toil, improve deployment safety, and accelerate incident response.
Integrating AI workloads into existing incident management processes, building runbooks, participating in on-call rotations, and conducting blameless post-mortems.
Building and maintaining CI/CD integrations, deployment safety checks, and rollback automation.
Collaborating with product engineering teams to improve reliability, contribute to architecture decisions, and ensure operational readiness for product releases.
Contributing to capacity planning, autoscaling configuration, and workload scheduling for AI compute infrastructure.
Requirements:
Expertise in SRE, infrastructure, or platform engineering, managing large-scale distributed systems with extensive operational experience.
Expertise in Kubernetes and large-scale containerization systems.
Experience defining SLOs and working with observability tools like Prometheus, Grafana, and distributed tracing to enhance system monitoring.
Proficiency in Python or Go for automation, CI/CD pipelines, deployment safety, and infrastructure-as-code like Terraform.
Interest in or experience with AI/ML infrastructure, model serving, or GPU workloads.
Ability to resolve issues independently while maintaining accountability throughout the process.
Accountability for reliability, developing automation and monitoring, and collaborating effectively with engineering teams unfamiliar with SRE practices.
Site Reliability Engineer (AI Infrastructure)
Site Reliability Engineer (AI Infrastructure)