Principal Site Reliability Engineer (AI Platform Architecture)

AI/ML

-, Olsztyn +9 Locations

Link Group

Full-time

B2B

Mid

Remote

8 059 - 10 005 USDNet per month - B2B

Job description

Key Responsibilities:

Defining the reliability architecture for AI compute services, including SLO frameworks, fault tolerance patterns, and advanced capacity planning models.
Driving hands-on development of automation and tooling that scales the SRE team's impact and eliminates operational toil.
Designing a comprehensive observability strategy, leveraging existing platforms to build specialized telemetry and GPU-specific monitoring for AI workloads.
Architecting deployment safety standards, including progressive rollouts, canary analysis, and automated rollback processes.
Embedding reliability into the development lifecycle by influencing product engineering architecture and high-level design decisions.
Mentoring and elevating the SRE team through design reviews, code reviews, and hands-on problem-solving.

Requirements:

Extensive experience in SRE or platform engineering, with a proven track record of impact at a principal or staff level.
Deep expertise in Kubernetes, specifically in managing autoscaling, resource scheduling, and orchestration for compute-intensive workloads.
Advanced programming expertise in Python or Go, with experience building production-grade automation and platform services.
Proven ability to influence cross-team technical decisions and elevate technical standards across engineering departments.
Experience or strong technical interest in AI/ML infrastructure, model deployment, and GPU workload optimization.
A system-level approach to designing reliability into innovative platforms while building strong partnerships with product engineering teams.

Tech stack

Machine Learning

advanced

AI

advanced

Go

advanced

Kubernetes

advanced

Python

advanced

SRE

advanced

Office location

Principal Site Reliability Engineer (AI Platform Architecture)

8 059 - 10 005 USDNet per month - B2B

Summary of the offer

Principal Site Reliability Engineer (AI Platform Architecture)

-, Olsztyn

Link Group

8 059 - 10 005 USDNet per month - B2B

By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Treść obowiązku informacyjnego z art. 13 RODO dla kandydatów biorących udział w rekrutacji Administratorem Pani/Pana danych osobowych... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Check similar offers