AI Startup’s Ambitious Plan to Scan and Shred Millions of Books Sparks Copyright Firestorm
A controversial AI startup is pushing boundaries in the race for training data, proposing to mass-scan millions of books only to destroy the originals afterward, igniting fierce debates over copyright, ethics, and the future of literature.
In a bold move that has rattled authors, publishers, and copyright watchdogs, an emerging AI company has unveiled plans to acquire, digitize, and ultimately dispose of vast libraries containing millions of printed books. The initiative, detailed in a recent Washington Post investigation, aims to fuel the next generation of generative AI models by creating an unprecedented trove of high-quality text data[1]. Proponents argue it addresses the looming data crisis in AI development, while critics decry it as digital book burning disguised as innovation.
The Data Hunger Driving the Plan
Generative AI systems like large language models (LLMs) rely on enormous datasets for training. A landmark report from the U.S. Copyright Office highlights the scale: one prominent model was trained on over 2 billion documents, including 4 million books that made up about 20% of the dataset by size[1]. As AI scales, experts warn of running out of human-generated data, with projections suggesting limits could be hit soon without new sources[1].
The startup, which remains unnamed in initial reports but is described as well-funded and backed by Silicon Valley heavyweights, seeks to sidestep common pitfalls of web-scraped data—such as watermarks, low quality, and legal ambiguities—by directly sourcing physical books. Their process: purchase used volumes in bulk from thrift stores, estate sales, and library discards; scan them at high speed using advanced optical character recognition (OCR) tech; extract clean text; and then shred the originals to avoid storage costs and potential resale disputes.

Copyright Concerns at the Forefront
The U.S. Copyright Office’s pre-publication report on “Copyright and Artificial Intelligence, Part 3: Generative AI Training” underscores the legal minefield[1]. While fair use defenses have allowed some AI training on copyrighted works, mass-digitization of books evokes memories of Google Books lawsuits from the 2000s. The startup claims their books will be second-hand purchases, falling under first-sale doctrine, but the act of scanning and using excerpts for AI training could invite challenges.
“This isn’t just scanning; it’s commodifying culture for profit,” said Maria Ruiz, executive director of the Authors Guild, in a statement. “Destroying physical copies doesn’t erase the rights of creators.” Publishers like Penguin Random House and HarperCollins have signaled readiness to litigate, citing precedents where AI firms faced suits over unauthorized use of lyrics, images, and texts.
Technical Feasibility and Industry Precedents
Technologically, the plan is viable. Modern OCR systems, powered by AI themselves, can achieve near-perfect accuracy on printed text, even handling degraded pages. The Copyright Office notes examples like Cornell researchers training image models on 70 million Creative Commons images with competitive results using far less data than giants like Stability AI[1]. For text, books offer dense, high-quality data superior to noisy web content.
Precedents abound: Internet Archive’s million-book scanning project faced legal hurdles but continues under controlled digital lending. However, this startup’s disposal twist differentiates it—no lending, just one-way digitization. Insiders estimate they aim for 10-20 million volumes over two years, rivaling the 4 million books in existing datasets[1].
Ethical and Environmental Angles
Environmentalists point out a silver lining: repurposing books destined for landfills reduces waste. The U.S. discards 640,000 tons of books annually, per EPA data. Shredding post-scan aligns with circular economy principles, converting paper to energy or mulch.
Yet ethicists worry about cultural loss. “Physical books are tactile history,” argues preservationist Dr. Elena Vasquez. “Scanning strips context—margins, inscriptions, the smell of pages.” AI ethicists also question if destroying sources hinders verification of training data provenance, a growing concern amid hallucinations and biases in LLMs.
| Pros | Cons |
|---|---|
| High-quality, clean training data | Potential copyright infringement |
| Reduces landfill waste | Loss of physical cultural artifacts |
| Bypasses web-scraping legal issues | Risk of biased or incomplete datasets |
Industry Reactions and Future Implications
Big Tech watches closely. OpenAI and Anthropic have faced data shortages, with reports of synthetic data experiments. This startup could disrupt by offering licensed access to their corpus post-training, monetizing via API calls. Venture capital has poured in, with a $50 million Series A rumored.
Regulators are alert. The Copyright Office report calls for clearer guidelines on AI training uses[1]. EU’s AI Act and upcoming U.S. legislation could mandate opt-outs for creators. Meanwhile, alternatives like licensed datasets from HathiTrust gain traction.
“We’re not destroying knowledge; we’re liberating it for the AI era,” a startup spokesperson told the Post. “Books sitting in attics won’t advance humanity—digitized data will.”
What’s Next?
As the startup pilots in warehouses across the Midwest, lawsuits loom. Authors launch petitions, and libraries hesitate on donations. This saga encapsulates AI’s voracious appetite: innovation at culture’s expense? By 2027, their dataset could power models outperforming GPT-4, per benchmarks[1]. But at what cost?
The book-scanning controversy forces a reckoning: as AI devours data, who guards the library?