Linux System Administrator
About the role
We are looking for a Linux System Administrator to support the Linux environment behind large-scale GPU infrastructure used for AI training and inference workloads.
This is a hands-on role focused on the deployment, maintenance, performance tuning, and reliability of Linux-based GPU servers. You will work closely with infrastructure and platform teams to keep the environment stable, secure, and ready for demanding production workloads.
Responsibilities
Install, configure, patch, and maintain Linux operating systems across GPU-based server environments
Manage and support the NVIDIA GPU software stack, including drivers, CUDA, cuDNN, NCCL, DCGM, and MIG/time-slicing configurations
Perform system performance tuning, kernel optimization, storage configuration, and networking setup for AI/HPC workloads
Develop and maintain automation scripts and operational tooling using Python, Bash, or similar technologies
Monitor system health, investigate alerts, and troubleshoot issues across hardware, drivers, operating systems, and cluster services
Support bare-metal provisioning and integration with orchestration platforms such as Slurm or Kubernetes
Work closely with Site Operations, DevOps/SRE, and AI/ML teams to support stable GPU cluster operations and infrastructure growth
Participate in on-call support, incident response, root cause analysis, and post-incident improvement activities
Support security hardening, patch compliance, vulnerability management, and operational standards across the server fleet
Requirements
4–8 years of hands-on experience in Linux system administration in production environments
Good knowledge of enterprise Linux environments, such as Ubuntu, Debian, Red Hat Enterprise Linux, or Rocky Linux
Experience with Linux administration at scale
Practical experience with configuration management, scripting, and infrastructure automation
Good scripting skills in Python and/or Bash
Good understanding of performance tuning, storage systems, and high-speed networking technologies such as RDMA, InfiniBand, or RoCE
Experience working with NVIDIA GPUs in Linux environments, including drivers, CUDA components, and GPU monitoring tools, will be a strong advantage
Ability to troubleshoot complex technical issues in production environments
English proficiency at least at a communicative level is required, as you will be working in an international team
Nice to have
Experience in AI/ML, HPC, or large-scale data center environments
Experience with bare-metal provisioning and fleet management
Familiarity with Slurm, Kubernetes, or similar orchestration tools
Knowledge of observability tools such as Prometheus and Grafana
Familiarity with DCIM platforms
Higher education in Computer Science, Engineering, or a related field
What we offer
Benefits package
Opportunity to work on Linux infrastructure supporting advanced AI workloads
Exposure to modern GPU hardware and high-performance computing technologies
Collaboration with experienced engineers across infrastructure, platform, and AI teams
A dynamic environment with room for ownership, learning, and professional growth
Linux System Administrator
Linux System Administrator