Principal Site Reliability Engineer (AI Platform Architecture)
Key Responsibilities:
Defining the reliability architecture for AI compute services, including SLO frameworks, fault tolerance patterns, and advanced capacity planning models.
Driving hands-on development of automation and tooling that scales the SRE team's impact and eliminates operational toil.
Designing a comprehensive observability strategy, leveraging existing platforms to build specialized telemetry and GPU-specific monitoring for AI workloads.
Architecting deployment safety standards, including progressive rollouts, canary analysis, and automated rollback processes.
Embedding reliability into the development lifecycle by influencing product engineering architecture and high-level design decisions.
Mentoring and elevating the SRE team through design reviews, code reviews, and hands-on problem-solving.
Requirements:
Extensive experience in SRE or platform engineering, with a proven track record of impact at a principal or staff level.
Deep expertise in Kubernetes, specifically in managing autoscaling, resource scheduling, and orchestration for compute-intensive workloads.
Advanced programming expertise in Python or Go, with experience building production-grade automation and platform services.
Proven ability to influence cross-team technical decisions and elevate technical standards across engineering departments.
Experience or strong technical interest in AI/ML infrastructure, model deployment, and GPU workload optimization.
A system-level approach to designing reliability into innovative platforms while building strong partnerships with product engineering teams.
Principal Site Reliability Engineer (AI Platform Architecture)
Principal Site Reliability Engineer (AI Platform Architecture)