Synthetic data boom

In 2026, training will still be anchored in human data and judgement, and that will continue to be the case for the next decade. The most capable models will be trained on carefully collected human signals about what “good” looks like in real workflows, real decisions, real conversations. Human data will define the objectives, the red lines, the tone, and the trade-offs. Synthetic will be used to automate large portions of the annotation pipeline and generate thousands of variations, without replacing the underlying human corpus that gives the system context and prevents drift.

Synthetic data will become a tool to expand, stress, and scale, providing cheaper and faster training pipelines for model development. This makes sense when you already know what good looks like from real production data, but you don’t see certain edge cases often enough. Consider training an agent that handles payments disputes. You use historical human-labeled cases to define the patterns and policies, coupled with synthetic variants of rare but high-risk scenarios—multi-currency chargebacks, overlapping fraud indicators, and obscure cross-border flows. The human data anchors the behavior; synthetic lets you stress-test and fill out the long tail without waiting years for enough real examples.

“In 2026 I see an increasing use of synthetic data, or partially synthetic data, where the model is good enough to do a plausible or not very good version of the thing, where in the past you would need humans. And now you can use a pretty weak open source model to take a first pass at it and then have the human tidy it up.” – Marek Duda

With an open-source model, you can generate a large pool of synthetic examples and let humans do the down-selection. We saw this in extremis with DeepSeek, effectively using one frontier system to produce training signal for another. That kind of “model-on-model” bootstrapping illustrates what synthetic data can do at scale, but it hits a ceiling. You end up with a knock-off version that mirrors the source model's capabilities but is fundamentally capped by the original system's limitations and blind spots. Human data in the form of feedback, supervised signals, and preference modeling, is the critical ingredient enabling advanced capabilities. In 2026, we’ll see models take a first pass with humans correcting the output, whereas previously you required humans to create the initial, high-quality example.

But understanding where human judgment remains non-negotiable is the difference between competitive advantage and costly missteps. The challenge is availability. True domain experts are scarce and already oversubscribed in their fields. The sophistication required to evaluate advanced model outputs, via nuanced reasoning, multi-step workflows, and real-world consequences, far exceeds simple labeling work. And the time required to collect high-quality signals at the volume needed for frontier training doesn't compress easily, even with better tooling. This is where Reinforcement Learning from Human Feedback (RLHF) has become essential—humans define the reward signals, verify outputs, and intervene precisely where synthetic data cannot capture edge cases, ambiguity, or real-world context.

“As we get closer to AGI, the amount of high-quality human data needed in narrow or specialized domains keeps rising, and it’s getting harder and harder to meet that bar.” – Jenny Bright

The safest rule of thumb for enterprises is simple: anchor on humans. For high-stakes use cases—regulated decisions, customer-facing agents, anything touching money, health, or kids—your primary training and eval data should still be first-party logs and expert labels, with synthetic used sparingly for stress tests, rare events, and “what if?” scenarios. Synthetic data could be useful around the edges: hardening RL environments, filling out long tails, and speeding up experiments on internal tools. But the systems you actually ship into production should be tuned, checked, and signed off on human data and human judgement. That’s where trust, accountability, and real competitive advantage still live.

By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Preferences

Reject Accept

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our privacy policy for more information.

Preferences

Reject Accept

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

In 2026, synthetic data becomes the real thing