Site Reliability Engineer (AI Infrastructure)

6 948 - 8 337 USDNet per month - B2B
AI/ML

Site Reliability Engineer (AI Infrastructure)

AI/ML
-, Gdańsk +9 Locations

Link Group

Full-time
B2B
Mid
Remote
6 948 - 8 337 USD
Net per month - B2B

Job description

Key Responsibilities:

  • Building and maintaining observability for AI workloads, including telemetry, dashboards, alerts, SLO/SLI tracking, and driving improvements when targets are missed.

  • Writing automation and tooling to reduce operational toil, improve deployment safety, and accelerate incident response.

  • Integrating AI workloads into existing incident management processes, building runbooks, participating in on-call rotations, and conducting blameless post-mortems.

  • Building and maintaining CI/CD integrations, deployment safety checks, and rollback automation.

  • Collaborating with product engineering teams to improve reliability, contribute to architecture decisions, and ensure operational readiness for product releases.

  • Contributing to capacity planning, autoscaling configuration, and workload scheduling for AI compute infrastructure.

Requirements:

  • Expertise in SRE, infrastructure, or platform engineering, managing large-scale distributed systems with extensive operational experience.

  • Expertise in Kubernetes and large-scale containerization systems.

  • Experience defining SLOs and working with observability tools like Prometheus, Grafana, and distributed tracing to enhance system monitoring.

  • Proficiency in Python or Go for automation, CI/CD pipelines, deployment safety, and infrastructure-as-code like Terraform.

  • Interest in or experience with AI/ML infrastructure, model serving, or GPU workloads.

  • Ability to resolve issues independently while maintaining accountability throughout the process.

  • Accountability for reliability, developing automation and monitoring, and collaborating effectively with engineering teams unfamiliar with SRE practices.

Tech stack

    CI/CD

    advanced

    SRE

    advanced

    Kubernetes

    advanced

    Python

    advanced

    Go

    advanced

Office location

Site Reliability Engineer (AI Infrastructure)

6 948 - 8 337 USDNet per month - B2B
Summary of the offer

Site Reliability Engineer (AI Infrastructure)

-, Gdańsk
Link Group
6 948 - 8 337 USDNet per month - B2B
By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Treść obowiązku informacyjnego z art. 13 RODO dla kandydatów biorących udział w rekrutacji Administratorem Pani/Pana danych osobowych... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.