For more than a decade, the nonprofit Common Crawl “has been scraping billions of webpages to build a massive archive of the internet,” notes the Atlantic, making it freely available for research.
“In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models.
“In the process, my reporting has found, Common Crawl has opened a back door for AI companies to…

![[CITYPNG.COM]White Google Play PlayStore Logo – 1500×1500](https://startupnews.fyi/wp-content/uploads/2025/08/CITYPNG.COMWhite-Google-Play-PlayStore-Logo-1500x1500-1-630x630.png)