DevOps Engineer
About the role
We are looking for a DevOps Engineer to help build and operate automation, deployment, and reliability standards for large-scale GPU infrastructure used for AI training and inference workloads.
In this role, you will work on software-defined infrastructure supporting GPU clusters, high-performance networking, storage platforms, and internal AI services. This is a hands-on position for someone who is comfortable working close to infrastructure, improving operational processes, and building reliable automation in a complex technical environment.
Responsibilities
Design, implement, and maintain Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components
Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling
Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments
Support reliability initiatives by defining and tracking SLIs/SLOs, automating incident response, and contributing to post-incident analysis
Automate operational tasks such as cluster scaling, firmware and BIOS updates, hardware validation, diagnostics, and capacity planning
Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations
Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation
Identify repetitive manual work and replace it with efficient automation
Evaluate new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations
Requirements
4–7 years of experience in DevOps, SRE, Platform Engineering, or a similar role
Strong practical experience with infrastructure automation in complex production environments
Good hands-on knowledge of Terraform, Ansible, or similar Infrastructure as Code tools
Experience building and maintaining CI/CD pipelines and working with GitOps practices
Good understanding of infrastructure security, vulnerability management, and security best practices
Experience with security tools such as Snyk, CrowdStrike, or similar solutions
Practical experience with Kubernetes
Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing
Good scripting or programming skills in Python, Go, or Bash
Experience with bare-metal provisioning, low-level infrastructure automation, or data center operations
Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry
Ability to work independently, prioritize tasks, and communicate effectively with technical teams
English proficiency at least at a communicative level is required, as you will be working in an international team
Nice to have
Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations
Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe, or Kubernetes-based schedulers
Experience integrating telemetry from power, cooling, or environmental systems
Experience building internal platforms or self-service tools for engineering teams
Understanding of compliance and audit requirements in security-sensitive environments
What we offer
Benefits package
Opportunity to work on advanced infrastructure supporting large-scale AI workloads
Real impact on the reliability and scalability of next-generation compute environments
Collaboration with experienced engineers across infrastructure, platform, and AI domains
A fast-moving environment with space for ownership, technical input, and professional growth
About the company
We are building large-scale GPU infrastructure designed for AI training, inference, and high-performance compute workloads. Our focus is on reliability, scalability, and operational efficiency for demanding production environments.
DevOps Engineer
DevOps Engineer