Categories: Cyber Security News

Reddit to Block Internet Archive as AI Companies Have Scraped Data From Wayback Machine

Reddit has announced plans to significantly restrict the Internet Archive’s Wayback Machine from indexing its platform, citing concerns that AI companies have been exploiting the archival service to circumvent Reddit’s data protection policies. 

The move represents another escalation in Reddit’s ongoing battle to control access to its user-generated content amid the AI training data boom.


Sponsored
style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-vivid-cyan-blue-color">Key Takeaways

1. The Wayback Machine will only be able to archive Reddit's homepage, not individual posts or comments.
2. Companies were using archived data to bypass Reddit's direct access restrictions
3. Reddit prefers paid licensing deals over free data access.

Block Wayback Machine Access

Starting today, Reddit will implement what it calls “ramping up” restrictions that will block the Wayback Machine from accessing post detail pages, comment threads, and user profiles. 

The Internet Archive will only retain the ability to index Reddit’s homepage, effectively limiting historical records to snapshots of trending headlines and popular posts on given dates.

“Internet Archive provides a service to the open web, but we’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine,” Reddit spokesperson Tim Rathschmidt explained

The company has identified specific instances where AI training companies have used the robots.txt bypass capabilities inherent in archived content to access Reddit data that would otherwise be restricted by the platform’s current API rate limiting and crawler blocking mechanisms.

Reddit’s technical implementation will likely involve updating its robots.txt file with specific User-Agent strings targeting Internet Archive crawlers, while potentially implementing server-side blocking based on IP ranges associated with the Wayback Machine’s infrastructure. 

This approach mirrors the platform’s recent strategy of blocking search engine crawlers unless companies enter paid licensing agreements.

This restriction forms part of Reddit’s comprehensive approach to monetizing its data assets in the AI era. 

The platform has entered into significant deals with Google and OpenAI for official data access, while simultaneously pursuing legal action against companies like Anthropic for allegedly continuing to scrape content after claiming to have stopped.

Sponsored

Reddit’s 2023 API pricing changes, which effectively shuttered popular third-party applications, were justified using similar reasoning about preventing unauthorized AI training.

The company has implemented rate limiting, authentication requirements, and usage monitoring across its technical infrastructure to maintain control over data access.

Mark Graham, director of the Wayback Machine, acknowledged ongoing discussions with Reddit about the matter, suggesting potential technical solutions may be explored. 

However, Reddit’s position appears firm: until the Internet Archive can guarantee compliance with platform policies regarding user privacy and content deletion respect, access will remain severely limited.

This development highlights the growing tension between open web archival principles and commercial data control in the AI training landscape.

Equip your SOC with full access to the latest threat data from ANY.RUN TI Lookup that can Improve incident response -> Get 14-day Free Trial

The post Reddit to Block Internet Archive as AI Companies Have Scraped Data From Wayback Machine appeared first on Cyber Security News.

rssfeeds-admin

Recent Posts

Canterbury town meeting progresses with congeniality, efficiency and some humor

Jostling a folded piece of paper, holding it marooned in the air, selectman Beth Blair…

2 hours ago

Boscawen voters address bus service concerns

Boscawen voters cruised through a speedy town meeting Friday night, one with so little controversy…

2 hours ago

Hulu, Disney Plus, and the Pixel Watch 4 are among this week’s best deals

Happy Saturday, all! This week, we found a number of deals that should help you…

2 hours ago

Prediction markets want the Oscars to be your gateway drug to betting on everything

Though it was weird to see the Golden Globes partner with Polymarket for its most…

3 hours ago

MacBook Air M5 review: a small update for the ‘just right’ Mac

Neo to the left of me. Pros are to the right. | Photo: Antonio G.…

3 hours ago

Zendesk to acquire Forethought AI to drive autonomous AI agents

Zendesk is to acquire Forethought AI. It says that this will be its largest acquisition…

3 hours ago

This website uses cookies.