DevOps Engineer

DevOps

DevOps Engineer

DevOps
Ogrodowa 8, Łódź

ALTER GPU CENTER

Full-time
B2B
Senior
Remote

Job description

About the role

We are looking for a DevOps Engineer to help build and operate automation, deployment, and reliability standards for large-scale GPU infrastructure used for AI training and inference workloads.

In this role, you will work on software-defined infrastructure supporting GPU clusters, high-performance networking, storage platforms, and internal AI services. This is a hands-on position for someone who is comfortable working close to infrastructure, improving operational processes, and building reliable automation in a complex technical environment.

Responsibilities

  • Design, implement, and maintain Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components

  • Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling

  • Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments

  • Support reliability initiatives by defining and tracking SLIs/SLOs, automating incident response, and contributing to post-incident analysis

  • Automate operational tasks such as cluster scaling, firmware and BIOS updates, hardware validation, diagnostics, and capacity planning

  • Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations

  • Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation

  • Identify repetitive manual work and replace it with efficient automation

  • Evaluate new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations

Requirements

  • 4–7 years of experience in DevOps, SRE, Platform Engineering, or a similar role

  • Strong practical experience with infrastructure automation in complex production environments

  • Good hands-on knowledge of Terraform, Ansible, or similar Infrastructure as Code tools

  • Experience building and maintaining CI/CD pipelines and working with GitOps practices

  • Good understanding of infrastructure security, vulnerability management, and security best practices

  • Experience with security tools such as Snyk, CrowdStrike, or similar solutions

  • Practical experience with Kubernetes

  • Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing

  • Good scripting or programming skills in Python, Go, or Bash

  • Experience with bare-metal provisioning, low-level infrastructure automation, or data center operations

  • Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry

  • Ability to work independently, prioritize tasks, and communicate effectively with technical teams

  • English proficiency at least at a communicative level is required, as you will be working in an international team

Nice to have

  • Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations

  • Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe, or Kubernetes-based schedulers

  • Experience integrating telemetry from power, cooling, or environmental systems

  • Experience building internal platforms or self-service tools for engineering teams

  • Understanding of compliance and audit requirements in security-sensitive environments

What we offer

  • Benefits package

  • Opportunity to work on advanced infrastructure supporting large-scale AI workloads

  • Real impact on the reliability and scalability of next-generation compute environments

  • Collaboration with experienced engineers across infrastructure, platform, and AI domains

  • A fast-moving environment with space for ownership, technical input, and professional growth

About the company

We are building large-scale GPU infrastructure designed for AI training, inference, and high-performance compute workloads. Our focus is on reliability, scalability, and operational efficiency for demanding production environments.


Tech stack

    English

    B2

    CI/CD

    advanced

    DevOps

    advanced

    Terraform

    regular

    Ansible

    regular

    Kubernetes

    regular

    Python

    regular

    Go

    regular

    Prometheus

    regular

    Grafana

    regular

Office location

DevOps Engineer

Summary of the offer

DevOps Engineer

Ogrodowa 8, Łódź
ALTER GPU CENTER
By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Please be informed that the data controller is ALTER GPU CENTER (hereinafter "controller"). You have the right to request access to yo... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.