Lead DevOps Engineer
About the role
We are looking for a Lead DevOps Engineer to provide technical leadership for DevOps and Site Reliability Engineering practices supporting large-scale GPU infrastructure used for AI training and inference workloads.
This role combines hands-on engineering with team leadership. You will be responsible for shaping automation standards, improving platform reliability, and leading a team working on software-defined infrastructure, high-performance networking, observability, and operational excellence across complex production environments.
Responsibilities
Lead, mentor, and support a team of DevOps and SRE engineers working across the full lifecycle of GPU infrastructure platforms
Design and implement Infrastructure as Code solutions for provisioning and managing bare-metal GPU servers, networking, storage, and cluster orchestration components
Build and improve CI/CD pipelines for infrastructure, platform services, and internal tooling
Develop and maintain monitoring, logging, alerting, and observability solutions for large-scale GPU environments
Define and track SLIs/SLOs, improve incident response processes, and contribute to post-incident reviews and long-term reliability improvements
Work closely with Infrastructure, Networking, Facilities, and AI/ML teams to ensure stable and scalable platform operations
Automate operational processes such as cluster scaling, firmware and BIOS updates, hardware diagnostics, and capacity planning
Support DevSecOps practices, including infrastructure hardening, vulnerability management, and compliance automation
Identify operational inefficiencies and reduce repetitive manual work through automation
Evaluate and introduce new tools and solutions related to GPU infrastructure, orchestration, and cloud-native operations
Requirements
8+ years of experience in DevOps, SRE, Platform Engineering, or a similar area
At least 3 years of experience in a technical lead, lead engineer, or team leadership role
Strong practical experience with infrastructure automation in large-scale or complex production environments
Very good knowledge of Terraform, Ansible, Pulumi, Crossplane, or similar Infrastructure as Code tools
Experience with GitOps, configuration management, and CI/CD practices
Hands-on experience with Kubernetes
Experience working with GPU-related technologies such as NVIDIA GPU Operator, device plugins, MIG, or time-slicing
Good scripting or programming skills in Python, Go, or Bash
Experience with bare-metal provisioning, infrastructure automation, or data center environments
Good knowledge of observability tools such as Prometheus, Grafana, Loki, and OpenTelemetry
Good understanding of distributed systems reliability and production incident management
Experience with high-performance networking technologies such as RDMA, InfiniBand, or RoCE will be a strong advantage
Ability to lead technical discussions, support team development, and communicate effectively with both technical and business stakeholders
English proficiency at least at a communicative level is required, as you will be working in an international team
Nice to have
Experience in AI infrastructure, HPC environments, hyperscale infrastructure, or data center operations
Familiarity with orchestration and scheduling tools such as Slurm, Ray, Run:ai, KServe, or Kubernetes-based schedulers
Experience integrating telemetry from power, cooling, or environmental systems
Experience building internal platforms or self-service tools for engineering or research teams
Understanding of security, compliance, and audit requirements in regulated or security-sensitive environments
What we offer
Benefits package
Opportunity to shape the DevOps and SRE foundation for advanced GPU infrastructure supporting AI workloads
Real impact on the scalability, reliability, and operational standards of next-generation compute environments
Collaboration with experienced engineers across infrastructure, platform, and AI domains
A dynamic environment with space for ownership, technical leadership, and professional growth
Lead DevOps Engineer
Lead DevOps Engineer