Start United States USA — IT How to identify OpenAI's crawler bot to stop it slurping websites for...

How to identify OpenAI's crawler bot to stop it slurping websites for training data

Von

August 8, 2023

150

Aww, c’mon, let us scrape your pages, we’ve got billions at stake
OpenAI, the maker of machine learning models trained on public web data, has published the specifications for its web crawler so that publishers and site owners can opt out of having their content scraped.
The newly released technical document describes how to identify OpenAI’s web crawler GPTBot through its user agent token and string, which get emitted by the company’s software in the HTTP request header sent to ask a server for a web page.
Web publishers can thus add an entry into their web server’s robots.txt file to tell the crawler how it should behave, assuming GPTBot was designed to heed the Robots Exclusion Protocol – not all bots do so. For example, the following set of robots.txt key/value pairs would instruct GPTBot to stay out of the root directory and everything else on the site.
However, OpenAI insists that allowing its bot to collect site data can improve the quality of AI models the biz builds and scraping can be done without gathering sensitive information – for which OpenAI and Microsoft were recently sued.
„Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies,“ the ML super-lab’s documentation reads.
„Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.“
And who wouldn’t want to save OpenAI the time and expense of making its models more capable and less risky?
Even so, OpenAI’s acknowledgement that it trains its large language models on the public internet has coincided with efforts by organizations to limit automated access to information via the web.