CEO on Inc & Fast Co — Live AI Webinar May 18
Register now

Why frontier labs outsource RL environments: the domain coverage problem

Domain coverage is the binding constraint in RL training. Learn why frontier labs outsource RL environments and what to ask when evaluating RL vendors.

Table of contents

Key Points

Every frontier lab has the same problem. The demand for high-quality RL environments is growing faster than any in-house team can cover. Not because the engineering is too hard — the algorithms, the infrastructure, the compute are all tractable. Because the domain expertise isn't there. And no amount of hiring solves it.

Why do AI labs outsource RL environments? Frontier labs outsource RL environments because domain coverage — not compute or algorithms — is the binding constraint in post-training. Building a production-grade RL environment for a new workflow requires deep expertise in that domain, not just ML engineering. Your in-house team can build coding and math environments. It cannot staff its way to coverage across enterprise software, healthcare, legal, financial operations, and the dozens of other domains where agentic AI is being deployed.

The supply problem you can't hire your way out of

The economics of RL environment construction are unlike almost anything else in the AI training stack. Compute scales with spend. GPUs scale with procurement. Algorithms improve through research iteration that in-house teams can run internally. Domain expertise doesn't scale the same way.

The current supply chain for RL environments comprises a few hundred engineers across the entire industry. Anthropic, OpenAI, and the other frontier labs have significant in-house capacity — and it isn't enough. The Information reported that Anthropic discussed spending over a billion dollars annually on RL environments, a signal of how central they've become to the training stack. That spend exists precisely because in-house teams can't cover the breadth of domains labs need.

The bottleneck isn't technical ambition. It's that every new domain you want to improve through reinforcement learning requires someone who actually understands the work — not just how to simulate it, but what correct looks like, where the edge cases live, and how to build a verifier that experts would trust. Startups have emerged to fill this gap, and the better AI companies have built supply chains that go deeper than engineering talent. The ones that haven't are constrained to the domains their internal reinforcement learning teams already know. Reinforcement learning has become the central axis of model capability growth, and the case for moving beyond pre-training is now well established across the frontier lab community.

Why domain expertise is the real constraint

Domain expertise is the ceiling of environment quality — not your algorithms, not your infrastructure, and not the size of your engineering team. Consider what building a production-grade RL environment for an enterprise workflow actually requires. You want to train AI agents that can navigate Salesforce, process insurance underwriting documents, or manage fund accounting operations in the real world. The engineering scaffolding — sandbox infrastructure, APIs, rollout pipelines — is the easy part. Any competent software engineering team can build it.

The hard part is encoding what correct looks like. What does a good fund accounting output actually contain? Where do experienced controllers disagree, and how should those disagreements be resolved in the reward function? What failure modes does the agent need to learn to avoid, and how do you simulate the long-horizon, multi-step tool use sequences that reflect real workflow complexity?

These are not questions an ML engineer can answer from documentation. They require someone who has done the work — a senior professional in the relevant domain who can describe the decision-making process in enough detail that it can be translated into training data, trajectories, and verifiable reward signals. A language model can approximate the surface of a domain. A production-grade RL training environment requires domain expertise that can't be approximated.

The labs that recognize this have stopped trying to solve it through hiring. The profiles don't overlap: the person best qualified to build a healthcare prior authorization environment is not the same person who builds agentic AI environments for coding tasks. Scaling domain coverage means scaling access to domain experts — a fundamentally different supply chain problem than scaling compute or algorithms.

What a production-grade environment requires from domain experts

A sellable, production-grade RL environment is not just a task spec and a pass/fail verifier. It requires expert input at every stage of construction.

Tasks need a difficulty distribution that produces a useful training signal — a minimum pass rate of around 2–3% on rollouts, calibrated against actual model behavior, not written from intuition. Getting there requires domain experts who understand the work well enough to design tasks at the right level of complexity, and who can iterate on that calibration as the training agents improve.

Reward functions need to capture genuine correctness in the domain. Verifiers need to be validated against expert judgment — not written from first principles and deployed, which is where most RL environments fail before training even begins. Evals need to measure what the environment was actually designed to train, not what's easy to automate. Each of these requires sustained expert engagement, not a one-time consultation. The high-quality environments that produce real benchmark improvements are the ones built with experts who stayed involved through calibration, not ones built from a brief.

This is the standard that separates RL environments that advance model capability from ones that produce training runs with clean metrics and flat real-world performance.

The Gym Architect model — a different supply chain

The traditional model for RL environment construction assumes the builder is an ML engineer who is also a domain expert. That model works for coding environments, where the builder population is large and the domain is the same one they work in. It breaks everywhere else.

The Gym Architect model inverts the assumption. Instead of finding ML engineers with domain expertise, it finds domain experts and gives them the infrastructure to build RL environments without needing ML knowledge. A senior healthcare administrator, a financial analyst with twenty years in fund accounting, an experienced logistics operator — these are the people who hold the domain knowledge that defines the ceiling of environment quality. The right platform extracts that knowledge through structured interrogation, translates it into runnable infrastructure, and routes technical questions back to the expert in plain language.

This is how you get coverage across healthcare, legal, financial services, enterprise software, and the other workflow domains where agentic AI is being deployed at scale. Not by hiring ML engineers who happen to know those domains — there aren't enough of them — but by building the interface that lets domain experts contribute directly to RL training without touching a codebase.

The supply chain implications are significant. Open-source frameworks on GitHub have commoditized the sandbox and orchestration layer. What they haven't commoditized is access to vetted, senior professionals across dozens of domains, compensated at rates that reflect the value of their expertise, engaged in a structured process that produces scalable, production-grade environments rather than one-off deliverables. That's the moat — not the algorithms, not the infrastructure, but the expert network and the process that connects it to RL environment construction at scale.

The exclusivity question

For lab buyers, the moat question has a specific commercial dimension. An RL environment that any competitor can license is a training input, not an advantage. An exclusive environment in a strategically important domain is a capability moat — one that compounds as the AI models trained against it improve and the environment evolves to match.

Exclusive environments command four to five times the price of non-exclusive arrangements, and the labs paying that premium understand why. If OpenAI and a competitor are both training on the same fund accounting environment, neither has a benchmark advantage in that domain. If you have an exclusive environment built by the senior professionals who define best practice in that workflow, the training signal is categorically different — and the next generation of model improvements in that domain flows to one place.

The exclusivity decision is a function of strategic priority. For domains where you have a clear commercial interest in being the best — where agentic AI deployment in that workflow is a revenue driver — exclusive environments are a defensible investment. For domains where coverage matters more than advantage, non-exclusive catalogue environments are more cost-efficient. Most lab training roadmaps include both, which is why the vendors that offer a range of exclusive and catalogue environments across genuine domain breadth are the ones that end up embedded in lab training roadmaps rather than handling one-off contracts.

Training a GPT-class LLM on generic datasets produces a capable model. Training it in a high-quality, exclusive RL environment built by the people who actually do the work produces a reinforcement learning feedback loop that compounds — and a capability in that domain that isn't easy to replicate.

What to ask when evaluating vendors

The RL environments market has expanded fast, and quality variance is significant. A few questions cut through the noise quickly.

How is domain expertise sourced? The ceiling of environment quality is the ceiling of the domain expertise behind it. If a vendor can't describe a specific process for sourcing, vetting, and engaging senior professionals in the target domain, the environment was built from approximations. Ask to see the expert profiles, not just the task specs.

What's the calibration process for verifiers? Automated evals that weren't validated against expert judgment train your AI models toward the wrong target. Ask specifically how the verifier was calibrated, and what the process looks like when expert reviewers and automated scoring disagree.

What does the environment do when the model saturates it? A static environment has a ceiling. A well-designed one evolves as training agents improve — adding difficulty, introducing new simulation environments, updating tasks to reflect distribution shift in the real world. Ask about the update and optimization process before you commit to a training roadmap — the failure points in an RL pipeline compound quickly when the environment stops producing a useful training signal.

Domain coverage is the binding constraint in post-training reinforcement learning. Compute scales. Algorithms improve. In-house teams cover what they can cover. The domains they can't — the real-world workflows where the next wave of agentic AI value is concentrated — require a different supply chain. One built around domain expertise, not engineering headcount. And one where the moat isn't the infrastructure. It's the people who know the work well enough to make the environment worth training against.

Ready to build RL environments your training roadmap can actually depend on? See how Invisible builds RL environments or get started today.

FAQs

Invisible solution feature: RL environments

Expert-built RL environments

Informed by real workflows and verified with domain experts. Ready for agents that need to do more than pass a test.
A flowchart showing incoming coding tasks, accounting data, banking transactions, and legal contracts flowing into a Real Work Agent and outputting documents for a reward.