Site Reliability Engineer (AI Infrastructure)

6 775 - 8 131 USDNet per month - B2B
AI/ML

Site Reliability Engineer (AI Infrastructure)

AI/ML
-, Warszawa +9 Locations

Link Group

Full-time
B2B
Mid
Remote
6 775 - 8 131 USDNet per month - B2B

Job description

Key Responsibilities:

  • Building and maintaining observability for AI workloads, including telemetry, dashboards, alerts, SLO/SLI tracking, and driving improvements when targets are missed.

  • Writing automation and tooling to reduce operational toil, improve deployment safety, and accelerate incident response.

  • Integrating AI workloads into existing incident management processes, building runbooks, participating in on-call rotations, and conducting blameless post-mortems.

  • Building and maintaining CI/CD integrations, deployment safety checks, and rollback automation.

  • Collaborating with product engineering teams to improve reliability, contribute to architecture decisions, and ensure operational readiness for product releases.

  • Contributing to capacity planning, autoscaling configuration, and workload scheduling for AI compute infrastructure.

Requirements:

  • Expertise in SRE, infrastructure, or platform engineering, managing large-scale distributed systems with extensive operational experience.

  • Expertise in Kubernetes and large-scale containerization systems.

  • Experience defining SLOs and working with observability tools like Prometheus, Grafana, and distributed tracing to enhance system monitoring.

  • Proficiency in Python or Go for automation, CI/CD pipelines, deployment safety, and infrastructure-as-code like Terraform.

  • Interest in or experience with AI/ML infrastructure, model serving, or GPU workloads.

  • Ability to resolve issues independently while maintaining accountability throughout the process.

  • Accountability for reliability, developing automation and monitoring, and collaborating effectively with engineering teams unfamiliar with SRE practices.

Tech stack

    CI/CD

    advanced

    Go

    advanced

    Kubernetes

    advanced

    Python

    advanced

    SRE

    advanced

Office location

Site Reliability Engineer (AI Infrastructure)

6 775 - 8 131 USDNet per month - B2B
Summary of the offer

Site Reliability Engineer (AI Infrastructure)

-, Warszawa
Link Group
6 775 - 8 131 USDNet per month - B2B
By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Treść obowiązku informacyjnego z art. 13 RODO dla kandydatów biorących udział w rekrutacji Administratorem Pani/Pana danych osobowych... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Check similar offers
SoftBlue (INTechHouse)

SoftBlue (INTechHouse)

Remote

Remote

4 305 - 6 996USD/month
DVC
CUDA
Docker
Langchain
Linux
vLLM,
Kubernetes
MLflow
NVIDIA
Python
MidMidB2BB2B
New
ADVERTISEMENT: Recommended by Just Join IT
Salary
6 775 - 8 131 USD
Net per month - B2B
Applied -
25 day left (until 16.07.2026)
Applied -
Check similar offers
SoftBlue (INTechHouse)

SoftBlue (INTechHouse)

Remote

Remote

4 305 - 6 996USD/month
DVC
CUDA
Docker
Langchain
Linux
vLLM,
Kubernetes
MLflow
NVIDIA
Python
MidMidB2BB2B
New
Sorigo

Sorigo

Warszawa

Remote

Remote

Undisclosed Salary
Azure
REST API
Terraform
Databricks
Python
MidMidB2BB2B
New
PKO BP Finat

PKO BP Finat

Remote

Remote

Undisclosed Salary
CI/CD
Docker
Kubernetes
Python
MidMidB2BB2B
New
Pretius

Pretius

Warszawa

Remote

Remote

33 - 38USD/h
Git
LLM
CI/CD
Docker
SDLC
RAG
Azure
Langchain
Langgraph
Python
MidMidB2BB2B
New
Netguru

Netguru

Remote

Remote

2 697 - 4 316USD/month
GenAI
LLM
Microsoft APIs
AI
vector databases
RAG
OpenAI
Embeddings
Python
MidMidPermanent, B2BPermanent, B2B
New
ADVERTISEMENT: Recommended by Just Join IT