Senior ML Platform Reliability & Infrastructure Engineer

7 947 - 8 882 USDNet per hour - B2B
6 400 - 6 956 USDGross per hour - Permanent
AI/ML

Senior ML Platform Reliability & Infrastructure Engineer

AI/ML
Fabryczna 6, Wrocław

Holisticon Connect

Full-time
Permanent, B2B
Senior
Remote
7 947 - 8 882 USD
Net per hour - B2B
6 400 - 6 956 USD
Gross per hour - Permanent

Job description

Holisticon Connect is a division within NEXER GROUP - a custom software development company. We started in Poland in 2017 and are now a team of over 140 people. We have the opportunity to work with world-renowned brands from Scandinavia, the UK, and Western Europe. Our goal is to grow stronger, in competence rather than in numbers. If you like what we do, check out our offer, maybe we will have the pleasure of meeting you! 😊  


We are looking for a Senior ML Platform Reliability & Infrastructure Engineer to join a highly advanced drug discovery platform team working at the intersection of machine learning, large-scale data systems, and computational science. The team builds the core infrastructure that enables AI-driven drug design — processing massive biological and chemical datasets and running large-scale model training on high-performance computing systems.

The mission is to transform cutting-edge research models into reliable, scalable, production-grade systems that directly support the discovery of new medicines. This is a highly impactful environment where engineering excellence meets real-world scientific innovation in the life sciences domain.

 

We offer a choice of employment form: B2B or Employment Contract 

  • UoP: 23 000 – 25 000 PLN gross/month

  • B2B: 170 – 190 PLN net/hour + VAT

Responsibilities:

  • Profile and optimise inference latency and throughput for model-serving runtimes handling high-volume prediction requests behind a routing/gateway layer.

  • Design and implement comprehensive observability across the platform by adding distributed tracing, effective logging, Grafana dashboards, alerting policies, and SLO/SLI frameworks using Prometheus, Loki, and OpenTelemetry.

  • Harden Kubernetes workloads running on GKE by optimising GPU/CPU resource tuning and improving scaling of resources.

  • Improve the resilience of asynchronous job pipelines built on Argo Workflows, Daprpub/sub, and Redis, including retry strategies, dead-letter handling, and backpressure mechanisms.

  • Collaborate with ML engineers and scientists to reduce friction in the model lifecycle from training and registration through to production serving.

You might be the perfect match if you are/have:

  • Distributed systems engineering - 5+ years designing, operating, or scaling multi-service architectures in production. Strong intuition for failure modes, cascading faults, and capacity planning.

  • Kubernetes - deep, hands-on experience with GKE or equivalent managed Kubernetes.

  • Comfortable with networking, RBAC, resource management, custom controllers, and debugging pod-level issues (OOMKilled, CrashLoopBackOff, scheduling failures).

  • Observability - proven track record building monitoring stacks using Prometheus, Grafana, Loki, and OpenTelemetry. Able to define meaningful SLIs, configure alerting that reduces noise, and build dashboards that accelerate incident response.

  • Python - strong proficiency. The majority of the platform is Python, including FastAPI services, async GraphQL APIs, ML serving runtimes, and data pipelines.

  • Infrastructure as Code - experience with Terraform managing cloud resources at scale on GCP or AWS.

  • Cloud platforms - working knowledge of GCP services: GKE, Cloud SQL, Secret Manager, IAM.

Moreover, we appreciate skills in these areas:

  • ML infrastructure - exposure to model serving frameworks (MLServer/Seldon, Ray/Anyscale, or similar), training pipelines, and model registry patterns (W&B or similar).

  • Message-oriented architectures - experience with Dapr, Redis Streams, or comparable pub/sub and event-driven patterns.

  • Workflow orchestration - hands-on with Argo Workflows, Prefect, Airflow, or similar DAG-based pipeline engines.

  • GraphQL APIs - familiarity with Apollo Server or Strawberry GraphQL in production settings.

  • Incident response - comfortable leading debugging sessions across distributed components, correlating logs, traces, and metrics to identify root cause under time pressure.

By joining us, you gain the following: 

  • Opportunity to work on exciting, international projects in cutting-edge industries like Automotive, Biotech, IoT; 

  • Possibility to develop in cloud technologies;

  • Becoming part of a team that believes that the next step to a promising future is to put your heart into it and make it happen; 

  • Respect for your private life so you don't have to work overtime or on weekends; 

  • Team Events budget to socialize outside of work; 

  • Company Events to celebrate smaller and bigger successes (Summer Party, Programmer's Day, and trips abroad – so far we've been in South Africa, Are, and Barcelona). 


Perks and benefits:
 

  • Fully remote work or in our office in Wrocław; 

  • Benefits such as Luxmed, Multisport, and life insurance in Nationale Nederlanden

  • Attractive referral system (9,5k for senior, 6k for mid, 2,5k for junior); 

  • Personal Training Budget with additional paid hours; 

  • Passion Day - an extra day off for your hobby to spend as you please; 

  • Flexible working hours with no micro-management approach. Our core hours are 9-15, the rest of the working time is up to you; 

  • We provide high-quality work equipment + 2 additional monitors and accessories. 
     

If you apply for this position and match our expectations, then: 

1) You will be invited to an HR Screening with our IT Recruiter. 

2) You will have a technical interview. 

3) You will meet with a client. 

Submit your application online in one easy step! Apply now! 

Tech stack

    Python

    master

    Kubernetes

    advanced

    Terraform

    advanced

    Grafana

    advanced

    Prometheus

    advanced

    AWS

    advanced

    OpenTelemetry

    regular

    GKE

    regular

    RBAC

    regular

    GCP Services

    regular

Office location

Senior ML Platform Reliability & Infrastructure Engineer

7 947 - 8 882 USDNet per hour - B2B
Summary of the offer

Senior ML Platform Reliability & Infrastructure Engineer

Fabryczna 6, Wrocław
Holisticon Connect
7 947 - 8 882 USDNet per hour - B2B
6 400 - 6 956 USDGross per hour - Permanent
By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Informujemy, że administratorem danych jest Holisticon Connect z siedzibą we Wrocławiu, ul. Fabryczna 6 (dalej jako "administrator").... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Salary
7 947 - 8 882 USD
Net per hour - B2B

6 400 - 6 956 USD
Gross per hour - Permanent
Applied -
15 day left (until 01.05.2026)
Applied -