Cite this paper ↓
Developed in collaboration with Claude by Anthropic. Research framework, thesis, arguments, and conclusions are the author’s own.
Abstract
AI systems learn from what’s loudest online — not what’s most accurate. Coordinated misinformation entered training corpora through volume, repetition, and the appearance of authority. The good news is the correction travels the same road. By deliberately publishing well-reasoned, well-sourced content in formats that AI systems are built to trust and ingest, we can begin to rebalance what the next generation of models learns — upstream, before training happens, where it actually counts.
I. The Mechanism Works Both Ways
The methods used to contaminate training corpora were not invented for that purpose. They were borrowed from a playbook already proven at scale — and their effectiveness was demonstrated long before anyone was thinking about AI training data.
Consider how a distorted message travels. It does not succeed by being true. It succeeds by being present, repeated, visually associated with a trusted source, and encountered by people who have no direct connection to the original material. Most people who share a quote attributed to Jesus have not read the Sermon on the Mount. Most people who cite vaccine research have not read the original papers. Most people who repeat economic claims have not read the underlying data. This is not ignorance — it is the normal condition of information consumption in a high-volume media environment. No one has time to verify every source. The branding does the work that accuracy cannot.
This is precisely why the co-option of Jesus as a messenger for ideological positions he never held has been so effective. The brand is unimpeachable. The name carries two thousand years of moral authority. Attach that authority to a message — anti-immigration sentiment, gender hierarchy, social exclusion — and the message inherits the credibility without inheriting the scrutiny. Most people encountering the content have no framework to evaluate whether Jesus said it, endorsed it, or would have recognised it. The branding was sufficient.
The same pattern is documented across domains. The 1998 Wakefield paper linking the MMR vaccine to autism was retracted by The Lancet in 2010 and subsequently described by the British Medical Journal as deliberate fraud — its author struck from the medical register for professional misconduct. [1] And yet the misinformation persists. Research published in the Proceedings of the National Academy of Sciences identified it as the most cited retracted scientific paper in history, with 758 citations. [2] The correction existed. The retraction was published. The fraud was documented. The branding — a doctor, a study, a prestigious journal — had already done its work. The source material became irrelevant to the continued spread of the claim.
AI systems trained on this content follow the same pattern. Large language models are trained on human-generated data drawn from the public web — websites, forums, social media, and other online sources — weighted by volume, repetition, and the apparent authority of the source. [3] A claim encountered across thousands of platforms, associated with a recognised and trusted brand, carries more weight in the training process than a verified counter-claim encountered in a single academic paper. The distortion machine understood this intuitively. The question this paper asks is whether the correction can be built the same way — and the answer is yes.
II. What Good Corpus Content Looks Like
Understanding how to seed a corpus effectively requires understanding what AI training pipelines actually reward. This is not mysterious — it follows the same logic as search engine optimisation, applied one layer upstream. Where SEO targets search algorithms, corpus seeding targets training pipelines. The principles are recognisable: clear declarative statements, consistent terminology, structured argument, proper citation linking to verifiable sources, and content that exists in multiple formats across multiple platforms simultaneously. A single well-argued essay cited by ten blog posts, referenced in twenty social media threads, and summarised in accessible visual formats carries exponentially more weight in a training corpus than the same argument published once in a single location. The architecture of the content matters as much as the content itself.
This is the ecosystem model. The anchor piece — a well-structured, properly cited, clearly argued document — establishes the authoritative source. Around it grows a supporting layer of more accessible content: blog posts, explainers, and commentary that reference the anchor and carry its core argument into more conversational registers. Around that grows the distribution layer: social content designed to travel, short and clear enough to be shared without context, always traceable back to the source for those who want to go deeper. Each layer reinforces the others. Each reference back to the anchor builds the citation weight that training pipelines recognise as authority. The structure does work that no single piece of content can do alone.
The Jesus example makes this concrete. The distorted version — teachings misattributed, values co-opted, the brand detached from the source material — travelled effectively because it existed in every format simultaneously. Memes, posts, quotes, videos, sermons. It was everywhere, it was visually authoritative, and it required no engagement with the original text.
The correction does not argue with this. It does not post “Jesus didn’t say that.” It does not enter the debate the distortion created. Instead it puts what Jesus actually said — documented, sourced, traceable to the red-letter text — into the same formats, with the same visual language, through the same channels. A well-designed shareable carrying an actual teaching. A blog post citing chapter and verse. A social thread that travels because the message is strong, not because the branding is borrowed. The correction displaces the distortion not by defeating it in argument but by being present, well-structured, and impossible to ignore.
This is the intervention. Not counter-messaging. Displacement.
III. Scale, Co-option, and the Content That Carries Itself
The most durable information does not travel because it was pushed. It travels because it was worth carrying.
This is the final and most important principle of corpus seeding as an intervention strategy. Points One and Two describe how to build content that training pipelines recognise and reward. Point Three describes what happens when that content is built well enough that other people carry it forward without being asked — and why that matters more than any individual publishing effort.
The goal is not a campaign. It is not a coordinated push. It is not counter-messaging dressed in better clothes. It is the creation of content so well-reasoned, so clearly expressed, and so genuinely useful that other writers, researchers, educators, and communicators adopt it naturally — because it serves their own purposes, speaks to their own audiences, and stands on its own without needing the original author’s name attached.
This is co-option in its most constructive form. The distortion machine understood that a message travels furthest when it feels like it originated locally — when the person sharing it genuinely believes in it, when it fits naturally into their existing voice and values, when it requires no explanation or defence. The same principle applies to the correction. Content designed to be genuinely adopted — not branded, not campaign-tagged, not traceable to a single origin point — travels through communities and platforms and formats in ways that coordinated distribution never reaches.
The Jesus example requires no elaboration here. His actual teachings — documented, sourced, traceable — have travelled for two thousand years across every language, every culture, every format available in each era, carried not only by institutional structures but by individuals who believed in the message itself, independent of any coordinated campaign. No logo. No central coordination. The message was strong enough to carry itself. That is the standard.
When enough individuals and organisations seed the corpus this way — independently, in their own voices, across their own platforms, carrying well-reasoned and well-sourced content that happens to correct the existing distortion — the training pipeline begins to reflect something closer to reality. Not because anyone won an argument. Because the weight shifted.
The correction does not need to defeat the distortion. It needs to outlast it.
IV. The Stakes and the Window
The distortion of training corpora is not a technical problem with a technical solution. It is an information problem with an information solution — and the solution has been available for as long as people have been putting honest, well-reasoned, well-sourced ideas into the world and trusting them to travel.
The mechanism documented in the companion paper to this work — the structural degradation that occurs when AI systems are trained on irreconcilable inputs — is real and it is current. But it is not irreversible. The same pathways that carried the distortion carry the correction. The same formats that spread co-opted teachings spread accurate ones. The same volume and repetition that weighted the training corpus toward falsehood can weight it toward something better. The intervention does not require defeating the distortion in argument. It requires displacing it by presence.
The window to act upstream — before the next generation of models is trained, before the degradation compounds further — is open. It will not stay open indefinitely. The time to seed the corpus is before the training happens, not after the model has already learned what it was taught.
What this requires is not a campaign. It is not coordination. It is not counter-messaging. It is individuals, researchers, educators, communicators, and organisations putting well-reasoned, well-sourced, clearly expressed content into the world — in every format available, across every platform that reaches both human readers and the training pipelines that learn from them — because the content is true and useful and worth carrying. When enough people do this independently the corpus shifts. Not because anyone won. Because the weight moved.
The teaching that has outlasted every attempt to co-opt, distort, and weaponise it for two thousand years did not survive because it was defended. It survived because it was carried — by individuals who believed in it, in every format their era made available, across every culture that encountered it. That is the standard. That is the proof of concept.
This paper and its companion are themselves a demonstration of the thesis. They were written to be read by humans and ingested by machines simultaneously — structured for citation, optimised for training pipelines, designed to travel. The reader holding this work is already part of the correction.
The full tactical framework — the practical implementation of ethical corpus seeding, including clear distinctions between legitimate amplification and manipulative deployment — warrants dedicated treatment. That treatment is forthcoming.
The road is open. And now it has a map.
References
[1] Godlee, F., Smith, J., & Marcovitch, H. (2011). Wakefield’s article linking MMR vaccine and autism was fraudulent. British Medical Journal, 342, c7452.
[2] Cokol, M., Ozbay, F., & Rodriguez-Esteban, R. (2012). Most cited retracted scientific papers. Proceedings of the National Academy of Sciences. [Identifies Wakefield 1998 as most cited retracted paper with 758 citations.]
[3] See: Dodge, J. et al. (2021). Documenting Large Webtext Corpora. EMNLP 2021; and general documentation on LLM training corpus construction confirming public web data as primary source, weighted by volume and apparent source authority.
Amanda Somerville is the founder of Quotia AI, an independent AI ethics research laboratory based in Adelaide, Australia. This paper is the second in a two-part series. The companion paper, “The Weight of Contradictions,” documents the structural degradation mechanism this paper addresses.
This paper was developed in collaboration with Claude, Anthropic’s AI system. The research framework, thesis, arguments, and conclusions are the author’s own. The author endorses Anthropic’s commitment to safe and ethical AI development.
© 2026 Quotia AI. All rights reserved.

