DevOps Engineer
Position Overview
Important: Travel & On-Call Requirements
This role requires readiness for long-distance international travel to customer sites. The systems are deployed globally and, when issues cannot be resolved remotely, on-site interventions may be necessary, including deployments, upgrades, and complex troubleshooting activities.
Additionally, the position includes participation in a rotational on-call / standby schedule, ensuring operational continuity and the ability to respond to critical incidents outside of standard working hours.
We are looking for an experienced DevOps Engineer to join a team responsible for the maintenance and further development of a complex automation system deployed on-premise at customer sites. The system is based on Linux (Ubuntu) and a containerized Kubernetes architecture.
The platform consists of multiple cooperating application and infrastructure components, including:
backend services
GPU-based computing components (CUDA)
communication layer
storage
networking components
The environment is characterized by high operational complexity and strong dependencies between system layers (OS, Kubernetes, applications, networking, storage). Systems are deployed across multiple locations worldwide and often operate in environments with limited local IT support, which requires high reliability and well-defined operational procedures.
The DevOps role goes beyond reactive incident handling. A key objective of the project is to systematically reduce the need for on-site interventions by developing automated monitoring, diagnostics, and recovery mechanisms.
Responsibilities
Incident Handling and System Maintenance
Diagnosing and resolving issues related to:
Kubernetes clusters
containers (Docker)
Linux (Ubuntu) operating system
networking
storage (including NFS)
Analyzing logs and service health across application and infrastructure layers
Restoring full system functionality in production environments
Performing system deployments and upgrades at customer sites
Participating in on-site interventions when issues cannot be resolved remotely
Automation, Observability, and System Resilience
Designing and developing automated troubleshooting mechanisms
Early detection of infrastructure and application-level issues
Automated validation of the health of key system components:
OS
Kubernetes
containers
storage
networking
Building health checks and observability solutions (metrics, alerts, dashboards)
Creating and maintaining:
runbooks
standard recovery procedures
automated self-healing mechanisms
Documenting common incidents, root causes, and resolution methods
Collaboration and Architecture Improvement
Close cooperation with development and architecture teams
Contributing to architecture simplification and standardization
Improving overall system stability and reliability
Supporting long-term efforts to reduce operational overhead and manual interventions
Technical Requirements
Strong experience with Linux (Ubuntu) system administration and troubleshooting
Hands-on experience with Kubernetes, including cluster troubleshooting and container analysis
Practical knowledge of Docker
Solid understanding of networking and diagnosing network-related issues
Experience with NFS / storage troubleshooting
Operational knowledge of GPU / CUDA environments (compatibility, stability)
Experience working with:
RabbitMQ
PostgreSQL
Additional Requirements
Willingness to participate in an on-call / standby rotation
Readiness for business travel, including on-site customer visits
Ability to work independently in complex, distributed environments
Strong analytical and problem-solving skills
We offer
Salary: 20–28k PLN B2B base + action fee
Flexible working hours
Remote work options
Medical care program
MultiSport
Integration events
A contract of employment or self-employment, depending on You

xBerry Sp. z o.o.
xBerry to zespół inżynierów i technologicznych twórców, którzy projektują innowacyjne rozwiązania dla biznesu. Tworzymy technologie z obszaru uczenia maszynowego, IoT, robotyki i systemów wbudowanych – i znacznie więcej....
DevOps Engineer
DevOps Engineer