AI Startup’s Radical Plan to Scan and Shred Millions of Books Sparks Copyright Firestorm
WASHINGTON — An ambitious AI startup is plotting to scan millions of books for artificial intelligence training data, only to destroy the originals afterward, igniting fierce debates over copyright law, cultural preservation, and the future of publishing.
The controversial initiative, detailed in a recent Washington Post investigation, centers on Preserve AI, a Silicon Valley firm backed by venture capital heavyweights. The company’s blueprint involves acquiring vast libraries of physical books—potentially numbering in the millions—digitizing their contents at high speed using advanced optical scanners, and then disposing of the paper copies to cut storage costs and environmental impact.

Preserve AI’s CEO, Elena Vargas, defended the approach in interviews, arguing it represents a “necessary evolution” for feeding hungry large language models (LLMs) with high-quality, diverse text data. “Books are the gold standard for training AI,” Vargas stated. “Physical copies take up space, degrade over time, and aren’t accessible to models. We’re preserving knowledge digitally while responsibly managing resources.”[3]
Legal and Ethical Battlegrounds
The plan has drawn sharp criticism from authors, publishers, and copyright watchdogs. The U.S. Copyright Office, in a pre-publication report on generative AI training, highlighted the massive scale of data ingestion in AI development, noting one model trained on over 2 billion documents including 4 million books—about 20% of its dataset.[3] Preserve AI’s proposal amplifies these concerns by explicitly targeting physical destruction, raising questions about fair use and the transformation of copyrighted works.
“This isn’t preservation; it’s destruction,” fumed bestselling author Neil Gaiman on social media. “Scanning for AI without permission turns literature into fodder for chatbots, then tossing the originals? It’s cultural vandalism.” Publishers like Penguin Random House and HarperCollins have signaled potential lawsuits, echoing ongoing battles against AI firms like OpenAI and Anthropic over unauthorized use of books in training datasets.
Legal experts are divided. Proponents cite precedents like the Google Books project, which scanned 25 million volumes without destroying them and survived fair use challenges. “If the scans are used solely for non-expressive training—indexing and pattern recognition—destruction might not change the analysis,” said copyright scholar Jane Ginsburg of Columbia Law School. Critics counter that physical disposal signals bad faith and could undermine claims of transformative use.
Technical Feats and Industry Context
Preserve AI’s technology promises to scan a book in under 60 seconds, using robotic arms, infrared imaging, and machine learning to flatten pages non-destructively before digitization. Post-scan, books would be pulped or incinerated, with proceeds from recycled paper offsetting costs. The startup claims partnerships with underfunded libraries and estate sales to source volumes legally, focusing on out-of-print titles where rights are murky or orphaned.
This comes amid broader AI-media tensions. The Washington Post itself faced internal revolt over its error-prone AI-generated podcasts, where 68-84% of scripts failed quality checks yet launched anyway.[1][2] Staff decried misattributions and fabricated quotes, with one editor calling it “deliberately warping our own journalism.” Such blunders underscore risks when AI ingests flawed or copyrighted data, fueling skepticism toward ventures like Preserve AI.
| Project | Scale | Controversy | Outcome |
|---|---|---|---|
| Google Books | 25M books scanned | Copyright infringement suits | Fair use victory (2015) |
| OpenAI/ChatGPT | Billions of pages | Authors sue over book piracy | Ongoing litigation |
| Preserve AI (proposed) | Millions of books | Scanning + destruction | TBD |
Preservation vs. Progress
Advocates for Preserve AI point to data scarcity in AI development. A Cornell study warned of looming shortages of high-quality human-generated text, pushing firms toward aggressive scraping.[3] By digitizing rare books, the startup argues it safeguards knowledge against fires, floods, or neglect in crumbling archives.
Yet opponents fear a slippery slope. “Once physical copies are gone, control shifts to tech gatekeepers,” warned the Authors Guild. Libraries like the New York Public Library have rejected similar overtures, prioritizing stewardship over silicon.
Venture funding for Preserve AI has surged to $150 million, valuing the firm at $1.2 billion. Investors include AI pioneers like Sequoia Capital, betting on its edge in clean, licensed datasets amid regulatory crackdowns. The Copyright Office report urges clearer guidelines, noting watermarks and quality issues plague scraped data.[3]
“Models generate probability distributions over vocabularies, but garbage in means garbage out.” — U.S. Copyright Office AI Report[3]
Broader Implications for AI and Publishing
The saga reflects journalism’s AI growing pains. While outlets like The New York Times thrive with human-led podcasts, the Post’s missteps highlight rushed rollouts.[2] Preserve AI’s plan could accelerate AI capabilities but at what cost to creative industries?
As lawsuits loom, Preserve AI presses ahead with pilot scans in California warehouses. “We’re not villains; we’re visionaries,” Vargas insists. For now, the clash pits innovation against inheritance, with millions of books hanging in the balance.