Lead Linux System Administrator
About the role
We are looking for a Lead Linux System Administrator to take technical ownership of the Linux environment supporting large-scale GPU infrastructure used for AI training and inference workloads.
This role combines hands-on system administration with team leadership. You will be responsible for the stability, performance, security, and day-to-day management of Linux-based GPU servers, while also supporting and mentoring a team of administrators working in a complex production environment.
Responsibilities
Lead, mentor, and support a team of Linux System Administrators responsible for GPU infrastructure operations
Manage the full Linux server lifecycle, including provisioning, patching, configuration management, hardening, and performance tuning
Maintain and optimize the NVIDIA GPU software stack, including drivers, CUDA, cuDNN, NCCL, and GPU management tools such as DCGM and nvidia-smi
Support and manage MIG and GPU time-slicing configurations where needed
Develop and maintain automation for bare-metal provisioning, OS image management, and server configuration using tools such as Ansible, Terraform, and scripting
Tune Linux systems for demanding workloads, including kernel parameters, local storage, parallel file systems, networking, and scheduler settings
Troubleshoot complex issues across hardware, drivers, the operating system, and cluster-level services
Work closely with DevOps/SRE, Site Operations, and AI/ML teams to ensure smooth integration between OS-level infrastructure and higher-level orchestration platforms
Support security hardening, vulnerability management, patch compliance, and operational standards across the server fleet
Participate in on-call support and contribute to continuous improvements in reliability, performance, and operational efficiency
Requirements
7+ years of hands-on experience in Linux system administration in production environments
At least 3 years of experience in a technical lead, lead administrator, or people leadership role
Strong expertise in administering Linux systems at scale
Hands-on experience with NVIDIA GPUs in Linux environments, including drivers, CUDA ecosystem components, and GPU management tools
Strong experience with Ansible or other configuration management tools
Good scripting skills in Python and/or Bash
Experience with Infrastructure as Code and infrastructure automation
Good understanding of high-performance computing, storage systems, and high-speed networking technologies such as InfiniBand or RoCE
Experience supporting AI/ML or HPC workloads
Ability to troubleshoot complex production issues and work effectively in a high-availability environment
English proficiency at least at a communicative level is required, as you will be working in an international team
Nice to have
Experience with cluster management and orchestration tools such as Slurm, Kubernetes, or Run:ai
Familiarity with bare-metal provisioning tools and large server fleet management
Experience in AI infrastructure companies, hyperscalers, or HPC/research environments
Knowledge of Linux performance tuning for GPU-accelerated workloads
Higher education in Computer Science, Engineering, or a related field
What we offer
Benefits package
Opportunity to lead Linux infrastructure supporting advanced AI workloads at scale
Work with modern GPU hardware and software stacks in a technically demanding environment
Collaboration with experienced engineers across infrastructure, platform, and AI teams
A dynamic workplace with room for ownership, technical influence, and professional growth
Lead Linux System Administrator
Lead Linux System Administrator