Researchers have developed a novel method to automatically identify which large language models (LLMs) are being fed data by specific web scrapers. The technique involves hosting dynamic websites that serve unique "canary tokens" to each visiting scraper. By prompting LLMs and observing if they consistently generate outputs containing these unique tokens, researchers can infer which scrapers are supplying data to which LLMs. Experiments across 22 production LLM systems demonstrated the approach's reliability in identifying previously unknown scraper-LLM connections, offering a way for unprivileged third parties to gain insight into data sourcing and potentially control unwanted scraping. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Provides a method for identifying data sources for LLMs, potentially enabling better control over web scraping and data provenance.
RANK_REASON The cluster contains an academic paper detailing a new research method. [lever_c_demoted from research: ic=1 ai=1.0]