RL Environment Engineer (LLM Training Environments and Performance)

15 500 - 25 000 USDNet per month - B2B
AI/ML

RL Environment Engineer (LLM Training Environments and Performance)

AI/ML
San Carlos St 333, Bukareszt +2 Locations

Preference Model via XOR Inc.

Full-time
B2B
Senior
Remote
15 500 - 25 000 USD
Net per month - B2B

Job description

About the company

XOR is hiring exclusively on behalf of our partner Preference Model.


Preference Model is building the next generation of training data to power the future of AI. Today's models are powerful but fail to reach their potential across diverse use cases because so many of the tasks that we want to use these models for are outside of their training data distribution. Preference Model creates reinforcement learning environments that encapsulate real-world use cases, enabling AI systems to practice, adapt, and learn from feedback grounded in reality. We seek to bring the real world into distribution for the models.

Our founding team has previous experience on Anthropic’s data team building data infrastructure, tokenizers, and datasets behind the Claude model. We are partnering with leading AI labs to push AI closer to achieving its transformative potential.

The company has closed a large Seed round from Tier-1 VC’s in Silicon Valley and is working with top AI labs, informing priorities and timelines.


XOR runs the end-to-end hiring process for this role (screening, take-home, and coordination with the Preference Model team). Please apply through this posting to be considered.


What you’ll do

You will design and build realistic engineering tasks and environments that train and evaluate LLMs. Depending on your strengths, you may focus more on production ML systems or more on performance and low-level optimization - both are valuable here.


Responsibilities

  • Build MLE/SWE-style RL environments and tasks with strong engineering quality (not notebooks).

  • Target a specific model and match a defined difficulty distribution.

  • Iterate fast - edit and improve tasks within 24 hours based on feedback.

  • Ship with minimal supervision - strong ownership is key.


Must-haves (for everyone)

  • Strong Python (engineering-quality).

  • Production mindset - debugging, reliability, iteration speed.

  • Hands-on LLM/GenAI work in production (shipping and operating real systems).

  • Docker and end-to-end ownership (build, fix, scale pipelines).

  • At least 3 hours overlap with PST and advanced English (C1/C2).

  • You can meet throughput expectations and respond quickly to feedback.


Nice-to-have's (either track is great)

Track A - ML systems and LLM tooling (higher-level systems)

  • Evaluation harnesses, MLOps/CI/CD, monitoring, scalable pipelines, data tooling.

  • Experience designing tasks and environments for evaluations or RL-like feedback loops (nice to have, not required).

Track B - Performance and low-level optimization (kernel and inference track)

  • GPU/CPU performance fundamentals - memory hierarchies, threading/synchronization, cache/coalescing.

  • CUDA/HIP/ROCm kernel optimization, PyTorch custom ops/extensions, compiler/JIT stacks (Triton, XLA, TorchInductor, LLVM/MLIR/TVM).

  • Mixed/low precision kernels (FP16/BF16/FP8/INT8) and performance trade-offs.


Important noteYou do not need prior “RL Environments” job experience. If you’re a strong ML systems engineer or a strong performance and low-level engineer who can build rigorous tasks and tooling, you can be a great fit. Exposure to RL, bandits, or agentic systems is a plus, not a hard requirement.


Not a fit if

  • You’re primarily a prompt engineer without strong ML and engineering foundations.

  • You’re research-only with little or no production ownership.

  • You only build in notebooks or rely heavily on managed AutoML tools.


Working conditions

  • Remote contractor, full-time 40 hours per week, flexible schedule.

  • Bonuses per delivered tasks in addition to the base salary.

  • Potential path to FTE and relocation (performance and mutual fit).


Compensation

  • $90-$130 USD/hour base salary pay (equivalent of $15,00-$22,500), depending on seniority and take-home assignment quality.

  • Monthly performance bonuses in addition to the base pay.


Process

1) Apply via the job board

Please submit your CV and add a short note on which track fits you best:

Track A - ML systems and LLM tooling (higher-level systems) You build production LLM/ML systems: evaluation harnesses, data and tooling, MLOps/CI/CD, monitoring, scalable pipelines, reliability and debugging.

Track B - Performance and low-level optimization (kernel and inference track) You focus on performance and systems: GPU/CPU optimization, CUDA or kernel work, PyTorch extensions/custom ops, compiler/JIT stacks (for example Triton, TorchInductor, LLVM/MLIR), inference efficiency and profiling.


2) Short take-home assignment (form)

  • After you apply, XOR will share a short take-home in the format of a form with a small task.

  • The Preference Model technical team will review your submission.

  • In parallel, you can schedule a short call with XOR to learn more about the role and the company and ask questions.

3) Teamlead interview

  • If the take-home looks strong, we will schedule a technical interview with the Preference Model team.

  • Final decision is made after the interview.


Note on take-home compensation

  • Time spent on the take-home can be compensated if you receive an offer.


Tech stack

    English

    C1

    Python

    master

    LLMs

    master

    GenAI

    master

    Docker

    advanced

    Linux

    advanced

    PyTorch

    regular

    MLOps

    regular

    cicd

    regular

    CUDA

    junior

Office location

Published: 29.01.2026

RL Environment Engineer (LLM Training Environments and Performance)

15 500 - 25 000 USDNet per month - B2B
Summary of the offer

RL Environment Engineer (LLM Training Environments and Performance)

San Carlos St 333, Bukareszt
Preference Model via XOR Inc.
15 500 - 25 000 USDNet per month - B2B
By applying, I consent to the processing of my personal data for the purpose of conducting the recruitment process. Please be informed that the data controller is XOR Inc (hereinafter "controller"). You have the right to request access to your person... MoreThis site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.