DevOps Engineer

DevOps

Ogrodowa 8, Łódź

ALTER GPU CENTER

Full-time

B2B

Senior

Remote

Job description

About the role

We are looking for a DevOps Engineer to help build and operate automation, deployment, and reliability standards for large-scale GPU infrastructure used for AI training and inference workloads.

In this role, you will work on software-defined infrastructure supporting GPU clusters, high-performance networking, storage platforms, and internal AI services. This is a hands-on position for someone who is comfortable working close to infrastructure, improving operational processes, and building reliable automation in a complex technical environment.

Responsibilities

Design, implement, and maintain Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components
Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling
Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments
Support reliability initiatives by defining and tracking SLIs/SLOs, automating incident response, and contributing to post-incident analysis
Automate operational tasks such as cluster scaling, firmware and BIOS updates, hardware validation, diagnostics, and capacity planning
Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations
Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation
Identify repetitive manual work and replace it with efficient automation
Evaluate new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations

Requirements

4–7 years of experience in DevOps, SRE, Platform Engineering, or a similar role
Strong practical experience with infrastructure automation in complex production environments
Good hands-on knowledge of Terraform, Ansible, or similar Infrastructure as Code tools
Experience building and maintaining CI/CD pipelines and working with GitOps practices
Good understanding of infrastructure security, vulnerability management, and security best practices
Experience with security tools such as Snyk, CrowdStrike, or similar solutions
Practical experience with Kubernetes
Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing
Good scripting or programming skills in Python, Go, or Bash
Experience with bare-metal provisioning, low-level infrastructure automation, or data center operations
Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry
Ability to work independently, prioritize tasks, and communicate effectively with technical teams
English proficiency at least at a communicative level is required, as you will be working in an international team

Nice to have

Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations
Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe, or Kubernetes-based schedulers
Experience integrating telemetry from power, cooling, or environmental systems
Experience building internal platforms or self-service tools for engineering teams
Understanding of compliance and audit requirements in security-sensitive environments

What we offer

Benefits package
Opportunity to work on advanced infrastructure supporting large-scale AI workloads
Real impact on the reliability and scalability of next-generation compute environments
Collaboration with experienced engineers across infrastructure, platform, and AI domains
A fast-moving environment with space for ownership, technical input, and professional growth

About the company

We are building large-scale GPU infrastructure designed for AI training, inference, and high-performance compute workloads. Our focus is on reliability, scalability, and operational efficiency for demanding production environments.

Tech stack

English

CI/CD

advanced

DevOps

advanced

Terraform

regular

Ansible

regular

Kubernetes

regular

Python

regular

Go

regular

Prometheus

regular

Grafana

regular

Office location

DevOps Engineer

Summary of the offer

DevOps Engineer

Ogrodowa 8, Łódź

ALTER GPU CENTER

By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Please be informed that the data controller is ALTER GPU CENTER (hereinafter "controller"). You have the right to request access to yo... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Check similar offers