Principal Site Reliability Engineer (AI Platform Architecture)

7 860 - 9 757 USDNet per month - B2B
AI/ML

Principal Site Reliability Engineer (AI Platform Architecture)

AI/ML
-, Warszawa +9 Locations

Link Group

Full-time
B2B
Mid
Remote
7 860 - 9 757 USDNet per month - B2B

Job description

Key Responsibilities:

  • Defining the reliability architecture for AI compute services, including SLO frameworks, fault tolerance patterns, and advanced capacity planning models.

  • Driving hands-on development of automation and tooling that scales the SRE team's impact and eliminates operational toil.

  • Designing a comprehensive observability strategy, leveraging existing platforms to build specialized telemetry and GPU-specific monitoring for AI workloads.

  • Architecting deployment safety standards, including progressive rollouts, canary analysis, and automated rollback processes.

  • Embedding reliability into the development lifecycle by influencing product engineering architecture and high-level design decisions.

  • Mentoring and elevating the SRE team through design reviews, code reviews, and hands-on problem-solving.

Requirements:

  • Extensive experience in SRE or platform engineering, with a proven track record of impact at a principal or staff level.

  • Deep expertise in Kubernetes, specifically in managing autoscaling, resource scheduling, and orchestration for compute-intensive workloads.

  • Advanced programming expertise in Python or Go, with experience building production-grade automation and platform services.

  • Proven ability to influence cross-team technical decisions and elevate technical standards across engineering departments.

  • Experience or strong technical interest in AI/ML infrastructure, model deployment, and GPU workload optimization.

  • A system-level approach to designing reliability into innovative platforms while building strong partnerships with product engineering teams.

Tech stack

    Machine Learning

    advanced

    AI

    advanced

    Go

    advanced

    Kubernetes

    advanced

    Python

    advanced

    SRE

    advanced

Office location

Principal Site Reliability Engineer (AI Platform Architecture)

7 860 - 9 757 USDNet per month - B2B
Summary of the offer

Principal Site Reliability Engineer (AI Platform Architecture)

-, Warszawa
Link Group
7 860 - 9 757 USDNet per month - B2B
By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Treść obowiązku informacyjnego z art. 13 RODO dla kandydatów biorących udział w rekrutacji Administratorem Pani/Pana danych osobowych... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Check similar offers
Lea

Lea

Remote

Remote

Undisclosed Salary
AI
Business Analysis
Azure
Software Architecture
API
Jira
.Net
Analiza Biznesowa
Workshop facilitation
Data Integration
MidMidB2B, Mandate contractB2B, Mandate
New
ADVERTISEMENT: Recommended by Just Join IT
Salary
7 860 - 9 757 USD
Net per month - B2B
Applied -
25 day left (until 16.07.2026)
Applied -
Check similar offers
Lea

Lea

Remote

Remote

Undisclosed Salary
AI
Business Analysis
Azure
Software Architecture
API
Jira
.Net
Analiza Biznesowa
Workshop facilitation
Data Integration
MidMidB2B, Mandate contractB2B, Mandate
New
Sorigo

Sorigo

Warszawa

Remote

Remote

Undisclosed Salary
Azure
REST API
Terraform
Databricks
Python
MidMidB2BB2B
New
PKO BP Finat

PKO BP Finat

Remote

Remote

Undisclosed Salary
CI/CD
Docker
Kubernetes
Python
MidMidB2BB2B
New
Pretius

Pretius

Warszawa

Remote

Remote

33 - 38USD/h
Git
LLM
CI/CD
Docker
SDLC
RAG
Azure
Langchain
Langgraph
Python
MidMidB2BB2B
New
Netguru

Netguru

Remote

Remote

2 697 - 4 316USD/month
GenAI
LLM
Microsoft APIs
AI
vector databases
RAG
OpenAI
Embeddings
Python
MidMidPermanent, B2BPermanent, B2B
New
ADVERTISEMENT: Recommended by Just Join IT