Principal Site Reliability Engineer (AI Platform Architecture)

8 059 - 10 005 USDNet per month - B2B
AI/ML

Principal Site Reliability Engineer (AI Platform Architecture)

AI/ML
-, Białystok +9 Locations

Link Group

Full-time
B2B
Mid
Remote
8 059 - 10 005 USD
Net per month - B2B

Job description

Key Responsibilities:

  • Defining the reliability architecture for AI compute services, including SLO frameworks, fault tolerance patterns, and advanced capacity planning models.

  • Driving hands-on development of automation and tooling that scales the SRE team's impact and eliminates operational toil.

  • Designing a comprehensive observability strategy, leveraging existing platforms to build specialized telemetry and GPU-specific monitoring for AI workloads.

  • Architecting deployment safety standards, including progressive rollouts, canary analysis, and automated rollback processes.

  • Embedding reliability into the development lifecycle by influencing product engineering architecture and high-level design decisions.

  • Mentoring and elevating the SRE team through design reviews, code reviews, and hands-on problem-solving.

Requirements:

  • Extensive experience in SRE or platform engineering, with a proven track record of impact at a principal or staff level.

  • Deep expertise in Kubernetes, specifically in managing autoscaling, resource scheduling, and orchestration for compute-intensive workloads.

  • Advanced programming expertise in Python or Go, with experience building production-grade automation and platform services.

  • Proven ability to influence cross-team technical decisions and elevate technical standards across engineering departments.

  • Experience or strong technical interest in AI/ML infrastructure, model deployment, and GPU workload optimization.

  • A system-level approach to designing reliability into innovative platforms while building strong partnerships with product engineering teams.

Tech stack

    Go

    advanced

    Kubernetes

    advanced

    Python

    advanced

    SRE

    advanced

    AI

    advanced

    Machine Learning

    advanced

Office location

Principal Site Reliability Engineer (AI Platform Architecture)

8 059 - 10 005 USDNet per month - B2B
Summary of the offer

Principal Site Reliability Engineer (AI Platform Architecture)

-, Białystok
Link Group
8 059 - 10 005 USDNet per month - B2B
By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Treść obowiązku informacyjnego z art. 13 RODO dla kandydatów biorących udział w rekrutacji Administratorem Pani/Pana danych osobowych... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Salary
8 059 - 10 005 USD
Net per month - B2B
Applied -
30 day left (until 17.05.2026)
Applied -