Site Reliability Engineer (AI Infrastructure)

AI/ML

-, Szczecin +9 Locations

Link Group

Full-time

B2B

Mid

Remote

6 948 - 8 337 USDNet per month - B2B

Job description

Key Responsibilities:

Building and maintaining observability for AI workloads, including telemetry, dashboards, alerts, SLO/SLI tracking, and driving improvements when targets are missed.
Writing automation and tooling to reduce operational toil, improve deployment safety, and accelerate incident response.
Integrating AI workloads into existing incident management processes, building runbooks, participating in on-call rotations, and conducting blameless post-mortems.
Building and maintaining CI/CD integrations, deployment safety checks, and rollback automation.
Collaborating with product engineering teams to improve reliability, contribute to architecture decisions, and ensure operational readiness for product releases.
Contributing to capacity planning, autoscaling configuration, and workload scheduling for AI compute infrastructure.

Requirements:

Expertise in SRE, infrastructure, or platform engineering, managing large-scale distributed systems with extensive operational experience.
Expertise in Kubernetes and large-scale containerization systems.
Experience defining SLOs and working with observability tools like Prometheus, Grafana, and distributed tracing to enhance system monitoring.
Proficiency in Python or Go for automation, CI/CD pipelines, deployment safety, and infrastructure-as-code like Terraform.
Interest in or experience with AI/ML infrastructure, model serving, or GPU workloads.
Ability to resolve issues independently while maintaining accountability throughout the process.
Accountability for reliability, developing automation and monitoring, and collaborating effectively with engineering teams unfamiliar with SRE practices.

Tech stack

CI/CD

advanced

Go

advanced

Kubernetes

advanced

Python

advanced

SRE

advanced

Office location

Site Reliability Engineer (AI Infrastructure)

6 948 - 8 337 USDNet per month - B2B

Summary of the offer

Site Reliability Engineer (AI Infrastructure)

-, Szczecin

Link Group

6 948 - 8 337 USDNet per month - B2B

By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Treść obowiązku informacyjnego z art. 13 RODO dla kandydatów biorących udział w rekrutacji Administratorem Pani/Pana danych osobowych... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Check similar offers