IP, Datasets, and Compensation: The Real Arguments Behind "Training on the Open Web"

Introduction

"Training system-training policy why governments care about your gpu cluster loss functions-run curriculum design data mixtures emergent-behavior on the open web" sounds neutral. It isn't. It is shorthand for a fight over who gets to monetize the last twenty years of online culture ai boom neurips icml status games, who carries the legal risk, and who is forced to accept "facts on the ground" after the models are already deployed. Strip the marketing away and you end up with a handful of blunt questions: Who is allowed to copy what, for which purpose.
Who gets paid, by whom, and on what basis.
Who has to sue to find out what already happened. Everything else is spin. You can't make sense of the arguments until you separate three layers that get mixed on purpose: legality, economics, and legitimacy.

The three arguments people keep collapsing

When someone says "training on public data should be allowed," they might be saying one of three very different things.

Legal: "Under current copyright, we think training counts as fair use (or an equivalent exception) and courts will mostly agree."

Economic: "Even if we could get permission, transaction costs would be insane, so we're going to push for a default of 'free unless forced otherwise.'"

Legitimacy: "Culturally, if you put stuff on the public web, you don't get to complain when it's used to train models. That's the deal."

The other side is doing the same:

Legal: "Training involves mass copying of protected works; that's not covered by fair use or narrow exceptions."

Economic: "If you're going to build trillion-dollar products on top of our output, some of that value had better flow back."

Legitimacy: "This looks like enclosure of a commons: scrape first, ask forgiveness later."

Debates derail because people talk past each other across these layers. Companies lead with legality and inevitability. Creators lead with compensation and legitimacy. Platforms sit in the middle quietly counting how many sides they can charge.

Training is not abstract

Forget the word "training" for a second and look at what actually happens. To train most modern large models, you:

Copy vast amounts of data to your own infrastructure.
Normalize it, filter it, tokenize it, transform it.
Hold onto derived versions for a long time.
Use the resulting model weights commercially, sometimes at enormous scale.

From a copyright lens, there are two main questions:

Is making those copies and transformations allowed without permission or license.
Does the behavior of the model (what it can output) create derivative works or substitute for the originals in a way that matters legally.

Companies tend to focus on the first: "These are transient, non-expressive uses, similar to how search indexes or caches work."

Rightsholders tend to focus on the second: "Your system can now produce work that competes with mine; the copies were part of building a commercial substitute."

Courts are still working through this. Different jurisdictions will land differently. While that plays out, everyone behaves as if their preferred interpretation is already law.

Training on "the open web" is not symmetric

The phrase suggests "we're all in this together." We are not. Historically:

Big platforms have had the capital and infrastructure to crawl and store at web scale.
Most individual creators and small publishers never had symmetric access to the full graph of everyone else's content.

When a large model vendor says "we just trained on what's out there," what they mean is:

"We used our ability to hoover at unprecedented scale and turn diffuse, individually weak content into a concentrated, high-value model that we now control and monetize."

Nothing in the architecture of the web guaranteed that this would happen. This is the result of choices: lax robots.txt enforcement, permissive platform policies, weak data provenance expectations, and no prior need to care.

The people making the decisions about "open web" norms are not the same as the people whose work is being repackaged. You can argue that this is still socially beneficial. But you cannot pretend it's some neutral, emergent property of HTTP.

The pro-training case, honestly stated

If you talk privately to people building large models, the case for wide training access is less polished and more straightforward.

The impactful models depend on very large, diverse datasets.
The web is the cheapest, richest source of such data.
Negotiating permission at scale for historical data is practically impossible.
Delays while everyone litigates would slow progress and concentrate power in whoever already moved fast.
The social benefits of better models in medicine, education, accessibility, science, and productivity are large enough that we should default to allowing training, with some guardrails.

A few sub-arguments ride along:

Search, caching, and data mining have long relied on broad exceptions or tolerated practices; training should be treated similarly.
Models do not store works verbatim but compress patterns, so they are more like transformative analysis than copying.
Opt-out mechanisms are an acceptable compromise: if you really care, you can block crawlers or use specified tags and forms.

The rough truth: from this vantage point, any regime that requires affirmative consent for large-scale training is seen as effectively a ban for all but the most privileged data sources.

The anti-training (or "pay us") case, honestly stated

On the other side, professional creators, media media pipelines from text prompt to production asset companies, stock libraries, and others are not arguing about one-off fanfic. They are looking at:

Models that can already generate code, images, music, and text close enough to their own outputs to eat real demand.
Platforms offering "AI features" that directly compete with them, built on top of their own material.
A history of being told that "technology changed the game" and that they should find new revenue streams in whatever scraps are left.

Their case, stripped of PR:

Training required mass copying of our protected works without permission.
The resulting systems reduce the value of our work in the market, even if they never reproduce a piece verbatim.
We were never given a real choice in this; the scraping was done quietly before the debate started.
If you're going to treat our work as raw material for your products, we expect either the ability to say no or to negotiate payment.

There is also a political dimension:

If model makers keep the upside and creators bear the downside (lower licensing revenue, more competition, less bargaining leverage), the system is skewed.
If a small number of firms end up owning critical AI infrastructure trained on everyone else's work, this looks like enclosure and concentration, not innovation.

They are not just complaining about "learning from others." They are pointing at the fact that once the model exists, they have very little leverage left.

Platforms as silent actors

A lot of this is playing out through intermediaries. Social networks, video platforms, code hosts, news aggregators, image sites, and marketplaces have their own interests:

They control much of the content that matters for training.
They can negotiate as a block with AI companies, in private.
They can adjust their terms of service to grant themselves rights to license data onward.

Creators who rely on those platforms often discover that:

Their "choice" is between staying on the platform under new data terms or losing distribution.
Any compensation is negotiated at the platform level, with little transparency on what flows down.
Dataset access becomes a side business for platforms, not a collective bargaining tool for users.

So when you hear "we're working with partners to ensure fair compensation," remember: the "partners" are often platforms, not the individuals whose work trained the model.

Why "just pay people" is harder than it sounds

People like to propose a "Spotify for training data": log usage, pay out proportionally. In practice, several hard problems show up.

Attribution

Most models are trained on blended corpora from multiple sources: web scrapes, licensed sets, synthetic data, user logs. Determining how much a particular work contributed to a model's capabilities is not well-defined. It's even worse for styles and patterns rather than exact matches.

Tracking

Even if you could assign credit, you need:

Robust dataset provenance: where every training sample came from.
Persistent identifiers across crawls, mirrors, and transformations.
Auditability that regulators and rightsholders will accept.

Most existing pipelines were not built with this level of tracking. Backfilling it is nontrivial.

Valuation

What do you pay for?

Presence in the corpus.
Measured influence on specific capabilities.
Actual usage in user outputs that resemble a given work.

Each choice leads to different incentives and different gaming opportunities.

Collective bargaining

Individuals are in a weak position. Collective licensing would require:

New entities or expanded roles for existing ones (collecting societies, unions, trade groups).
Agreement on how to represent different types of creators.
Trust that intermediaries will not capture most of the money.

Meanwhile, AI companies prefer cutting bespoke deals with a few big rightsholders they can manage.

Jurisdictional fragmentation

Different countries will treat text and data mining, fair use, and exceptions for research versus commercial use differently. Any global compensation scheme has to navigate this patchwork, or accept being regional.

This is why most current "compensation" proposals are narrow:

Deals with few large publishers or stock providers.
One-time payments or flat licenses, not usage-based revenue sharing.
Opt-out mechanisms that shift the burden onto creators.

The economics of the web make precision expensive. The temptation is to declare it infeasible and push for broad free use, then sprinkle some money at whoever can sue.

Transparency as leverage

One of the quiet power moves in this space is opacity. If you do not know:

Exactly which datasets were used
Which sources dominate which capabilities
How much synthetic or user data is blended in

you cannot negotiate effectively or assess harm. Dataset disclosure fights are framed as "trade secrets" versus "openness." They are also negotiations over who gets enough information to push back.

From a model provider's perspective:

Opaque datasets reduce litigation risk in the short term and protect perceived competitive advantage.
Transparent datasets invite more lawsuits and copycats, but might defuse some political anger and enable targeted licensing.

From a creator's perspective:

Without transparency, you are told to accept that your work was "probably in there somewhere" with no proof and no recourse.
With transparency, at least you can organize, negotiate, or choose to block future crawling.

The direction this goes will tell you a lot about how serious any "fair compensation" statements are.

Possible settlement patterns

If you assume that large-scale training will not be banned outright, the realistic options look like variations on a few themes.

Contractual licensing at scale

Major publishers, stock providers, and platforms cut deals to license archives and ongoing feeds.

Pros:
Gives clear rights and payments for a subset of high-value content.
Provides some legal cover for model operators.

Cons:
Leaves long tail creators out.
Concentrates bargaining power further in a few large intermediaries.
Does not address past scraping without consent.

Opt-out with teeth

Standardized mechanisms (HTTP headers, robots extensions, registry services) to signal "do not train on this," backed by some legal consequences for ignoring them.

Pros:
Gives some agency to creators and smaller sites.
Technically simple once conventions stabilize.

Cons:
Mostly forward-looking; does not undo existing models.
Burden falls on creators to know and use the signals.
Large platforms can still change their own terms and override individual preferences.

Limited use safe harbors

Laws or guidelines that say:

Training on public data is permitted under certain conditions:

Non-reproduction of substantial parts.
No targeted outputs that replicate specific works.
Restrictions on certain high-risk domains (health, minors, etc.).

Pros:
Gives some clarity and reduces legal chilling.
Can differentiate research from commercial exploitation.

Cons:
Details matter; can easily be written to favor incumbents.
May be outpaced quickly by model capabilities.

Collective licensing experiments

New entities that pool rights from many creators and license training access, returns distributed via some formula.

Pros:
Closer to the "Spotify for training data" dream.
Gives small creators a path to participate.

Cons:
High setup and governance complexity.
Hard attribution and valuation problems remain.
Risk of capture and bureaucracy.

Realistically, we will see a messy mix. The important part is to understand each pattern's implications instead of being hypnotized by phrases like "fair" or "open."

What "open web" should mean now

The original "open web" idea was simple: public URLs, open standards, interoperable content, linkability. Models changed the stakes. Once "public" also means "available as industrial training data," the notion of openness has to be renegotiated.

Some people will opt out where they can.
Some will retreat into walled gardens and closed communities.
Some will accept training as a fact and focus on bargaining through platforms or collectives.

There is no single right answer. There is a wrong one: pretending that "open" necessarily implies "free raw material for any large-scale model forever."

If you are designing policy, products, or company strategy around this, you eventually have to say out loud which of these claims you are endorsing:

That mass training on public data is a legitimate, socially beneficial default that should be protected.
That mass training is acceptable only with clear compensation schemes.
That some categories of work (art, journalism, code, medical content) deserve different treatment.

You do not get to hide behind "the open web" as if it were a natural law. The models are already here. The question now is who gets to rewrite the social contract they were trained on.

AI Telegraph