Lead DevOps Engineer

DevOps

Lead DevOps Engineer

DevOps
Ogrodowa 8, Łódź

ALTER GPU CENTER

Full-time
B2B
Senior
Remote

Job description

About the role

We are looking for a Lead DevOps Engineer to provide technical leadership for DevOps and Site Reliability Engineering practices supporting large-scale GPU infrastructure used for AI training and inference workloads.

This role combines hands-on engineering with team leadership. You will be responsible for shaping automation standards, improving platform reliability, and leading a team working on software-defined infrastructure, high-performance networking, observability, and operational excellence across complex production environments.

Responsibilities

  • Lead, mentor, and support a team of DevOps and SRE engineers working across the full lifecycle of GPU infrastructure platforms

  • Design and implement Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components

  • Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling

  • Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments

  • Define and track SLIs/SLOs, improve incident response processes, and contribute to post-incident reviews and long-term reliability improvements

  • Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations

  • Automate operational processes such as cluster scaling, firmware and BIOS updates, hardware diagnostics, and capacity planning

  • Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation

  • Identify operational inefficiencies and reduce repetitive manual work through automation

  • Evaluate and introduce new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations

Requirements

  • 8+ years of experience in DevOps, SRE, Platform Engineering, or a similar area

  • At least 3 years of experience in a technical lead, lead engineer, or team leadership role

  • Strong practical experience with infrastructure automation in large-scale or complex production environments

  • Very good knowledge of Terraform, Ansible, Pulumi, Crossplane, or similar Infrastructure as Code tools

  • Experience with GitOps, configuration management, and CI/CD practices

  • Hands-on experience with Kubernetes

  • Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing

  • Good scripting or programming skills in Python, Go, or Bash

  • Experience with bare-metal provisioning, infrastructure automation, or data center environments

  • Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry

  • Good understanding of distributed systems reliability and production incident management

  • Experience with high-performance networking technologies such as RDMA, InfiniBand, or RoCE will be a strong advantage

  • Ability to lead technical discussions, support team development, and communicate effectively with both technical and business stakeholders

  • English proficiency at least at a communicative level is required, as you will be working in an international team

Nice to have

  • Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations

  • Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe, or Kubernetes-based schedulers

  • Experience integrating telemetry from power, cooling, or environmental systems

  • Experience building internal platforms or self-service tools for engineering or research teams

  • Understanding of security, compliance, and audit requirements in regulated or security-sensitive environments

What we offer

  • Benefits package

  • Opportunity to shape the DevOps and SRE foundation for advanced GPU infrastructure supporting AI workloads

  • Real impact on the scalability, reliability, and operational standards of next-generation compute environments

  • Collaboration with experienced engineers across infrastructure, platform, and AI domains

  • A dynamic environment with space for ownership, technical leadership, and professional growth

Tech stack

    English

    B2

    DevOps

    master

    CI/CD

    master

    Terraform

    advanced

    Kubernetes

    advanced

    Python

    regular

    Go

    regular

    Bash

    regular

    Prometheus

    regular

    Grafana

    regular

    Leadership

    regular

Office location

Lead DevOps Engineer

Summary of the offer

Lead DevOps Engineer

Ogrodowa 8, Łódź
ALTER GPU CENTER
By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Please be informed that the data controller is ALTER GPU CENTER (hereinafter "controller"). You have the right to request access to yo... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.