Job profile: On-premises infrastructure / private cloud - 100% remote,
Currently Poland, with potential to extend to other European countries
- Design and engineer all parts of an on-premises HPC installation (computational cluster) – starting with networking, hypervisor setup, Operating System image creation, storage, VM provisioning, software stack deployment etc.
- All above shall be delivered as a code snippets (Terraform, Ansible, Bash, Python) that can be used in a larger, automated workflow in order to build easily scalable systems
- Engineer code-based solutions for “Day 2 operations” e.g. automatically deploy/remove/replace VMs, upgrade/rebuild OS, add/remove software stack items, upgrade software stack
- Build monitoring and alerting around the running infrastructure
- Write code to integrate solutions into seamless workflows - CI/CD and (potentially) workflow systems
- Work with cluster engineers and software developers to fine-tune and improve the infrastructure and tools used to build and manage it
- Proactively look for ways and solutions to improve effectiveness of your own work
- Deliver solutions with integrated "safety switches" to reduce risk of human error, provide traceability and roll back capabilities and reduce IT security risk (AKA shift left security)
- Tasks assignment and tracking - JIRA
- Write documentation in Atlassian Confluence - we need to document our work to submit it for approval and to be able to turn our deliverables into "live" systems.
- No on-call duties, no work on tickets, no weekend work
On-premises IT Infrastructure:
- Bare metal servers with VMs
- VMWare ESXi Hypervisor o (nice to have) Open Source Hypervisors e.g. KVM, QEMU, Xen, Proxmox
- (nice to have) Experience with private cloud e.g. OpenStack
- (nice to have) Experience with High Performance Computing
- (nice to have) Experience with SAN/NAS storage systems
- (nice to have) Experience with Open Compute platform
Operating systems:
- Ubuntu Linux (or any from Debian Linux family) – preferred
- (nice to have) Red Hat Linux
Infrastructure as Code:
- Ansible
- Terraform – candidates who have no experience, but show potential to learn fast are also welcome
Networks:
- Solid understanding of Ethernet, VLANs, IPv4/IPv6, ARP, DHCP, DNS, TCP and UDP
- (nice to have) Public cloud
- Azure o Google Cloud Platform
Containers:
- (nice to have) Kubernetes clusters
Automation and other tools:
- HashiCorp Packer
- Shell scripting - Bash
- CI/CD - Azure DevOps or similar (e.g. Jenkins)
- GitHub, JIRA & Confluence (or similar)
- (nice to have) Python
- (nice to have) Prometheus/Grafana -
- (nice to have) GitOps - Argocd or similar -
- (nice to have) Helm
Ideal candidate's profile:
- Experience with on-premises systems engineering – Senior – 7+
- Experience with cloud – Junior/Regular – 3 years of experience
- Able to work autonomously for most of the day, creating solutions (code) that are part of a larger project
- Able to not only follow proposed design but also to present options, optimize and improve -
- Fluency in automation of tasks – CI/CD and workflow management systems
- Self-learner eager to do deep dives into unknown areas of technology
- Able to document his work pre and post implementation, create high level diagrams
- B2 - C1 English