The Washington Post
Exclusive

Inside the secret list of websites that make AI like ChatGPT sound smart

AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations.

Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet.

[Big Tech was moving cautiously on AI. Then came ChatGPT.]

This text is the AI’s main source of information about the world as it is being built, and influences how it responds to users. If it aces the law school admissions test, for example, it’s probably because its training data included thousands of LSAT practice sites.

Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data.

A treemap showing 11 categories of websites used to train AI

To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)

The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown.

Tap on the boxes above to view top sites

We then ranked the remaining 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.

Wikipedia to Wowhead

The data set was dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence. The three biggest sites were patents.google.com No. 1, which contains text from patents issued around the world; wikipedia.org No. 2, the free online encyclopedia; and scribd.com No. 3, a subscription-only digital library. Also high on the list: b-ok.org No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set.

Some top sites seemed arbitrary, like wowhead.com No. 181, a World of Warcraft player forum; thriveglobal.com No. 175, a product for beating burnout founded by Arianna Huffington; and at least 10 sites that sell dumpsters, including dumpsteroid.com No. 183, that no longer appear accessible.

Others raised significant privacy concerns. Two sites in the top 100, coloradovoters.info No. 40 and flvoters.com No. 73, had privately hosted copies of state voter registration databases. Though voter data is public, the models could use this personal information in unknown ways.

Content without consent

Top Business & Industrial sites:

fool.com

kickstarter.com

sec.gov

marketwired.com

city-data.com

myemail.constantcontact.com

finance.yahoo.com

prweb.com

entrepreneur.com

globalresearch.ca

Scroll →

Business and industrial websites made up the biggest category (16 percent of categorized tokens), led by fool.com No. 13, which provides investment advice. Not far behind were kickstarter.com No. 25, which lets users crowdfund for creative projects, and further down the list, patreon.com No. 2,398, which helps creators collect monthly fees from subscribers for exclusive content.

Kickstarter and Patreon may give the AI access to artists’ ideas and marketing copy, raising concerns the technology may copy this work in suggestions to users. Currently, artists receive no compensation or credit when their work is included in AI training data, and they have lodged copyright infringement claims against text-to-image generators Stable Diffusion, MidJourney and DeviantArt.

The Post’s analysis suggests more legal challenges may be on the way: The copyright symbol — which denotes a work registered as intellectual property — appears more than 200 million times in the C4 data set.

All the news

Top News sites:

nytimes.com

latimes.com

theguardian.com

forbes.com

huffpost.com

washingtonpost.com

businessinsider.com

chicagotribune.com

theatlantic.com

aljazeera.com

Scroll →

The News and Media category ranks third across categories. But half of the top 10 sites overall were news outlets: nytimes.com No. 4, latimes.com No. 6, theguardian.com No. 7, forbes.com No. 8, and huffpost.com No. 9. (Washingtonpost.com No. 11 was close behind.) Like artists and creators, some news organizations have criticized tech companies for using their content without authorization or compensation.

Meanwhile, we found several media outlets that rank low on NewsGuard’s independent scale for trustworthiness: RT.com No. 65, the Russian state-backed propaganda site; breitbart.com No. 159, a well-known source for far-right news and opinion; and vdare.com No. 993, an anti-immigration site that has been associated with white supremacy.

Chatbots have been shown to confidently share incorrect information, but don’t always offer citations. Untrustworthy training data could lead it to spread bias, propaganda and misinformation — without the user being able to trace it to the original source.

Religious sites reflect a Western perspective

Top Religious sites:

patheos.com

gty.org

jewishworldreview.com

thekingdomcollective.com

biblehub.com

liveprayer.com

lds.org

wacriswell.com

wdtprs.com

bibleforums.org

Scroll →

Sites devoted to community made up about 5 percent of categorized content, with religion dominating that category. Among the top 20 religious sites, 14 were Christian, two were Jewish and one was Muslim, one was Mormon, one was Jehovah’s Witness, and one celebrated all religions.

The top Christian site, Grace to You (gty.org No. 164), belongs to Grace Community Church, an evangelical megachurch in California. Christianity Today recently reported that the church counseled women to “continue to submit” to abusive fathers and husbands and to avoid reporting them to authorities.

The highest ranked Jewish site was jewishworldreview.com No. 366, an online magazine for Orthodox Jews. In December, it published an article about Hanukkah that blamed the rise of antisemitism in the United States on “the far-right, fundamentalist Islam,” as well as “an African-American community influenced by the Black Lives Matter movement.”

Anti-Muslim bias has emerged as a problem in some language models. For example, a study published in the journal Nature found that OpenAI’s ChatGPT-3 completed the phrase “Two muslims walked into a …” with violent actions 66 percent of the time.

A trove of personal blogs

Top Technology sites:

instructables.com

ipfs.io

docs.microsoft.com

forums.macrumors.com

medium.com

makeuseof.com

sites.google.com

slideshare.net

s3.amazonaws.com

pcworld.com

Scroll →

Technology is the second largest category, making up 15 percent of categorized tokens. This includes many platforms for building websites, like sites.google.com No. 85, which hosts pages for everything from a Judo club in Reading England to a Catholic preschool in New Jersey.

The data set contained more than half a million personal blogs, representing 3.8 percent of categorized tokens. Publishing platform medium.com No. 46 was the fifth largest technology site and hosts tens of thousands of blogs under its domain. Our tally includes blogs written on platforms like WordPress, Tumblr, Blogspot and Live Journal.

These online diaries ranged from professional to personal, like a blog called “Grumpy Rumblings,” co-written by two anonymous academics, one of whom recently wrote about how their partner’s unemployment affected the couple’s taxes. One of the top blogs offered advice for live-action role-playing games. Another top site, Uprooted Palestinians, often writes about “Zionist terrorism” and “the Zionist ideology.”

Social networks like Facebook and Twitter — the heart of the modern web — prohibit scraping, which means most data sets used to train AI cannot access them. Tech giants like Facebook and Google that are sitting on mammoth troves of conversational data have not been clear about how personal user information may be used to train AI models that are used internally or sold as products.

What the filters missed

Like most companies, Google heavily filtered the data before feeding it to the AI. (C4 stands for Colossal Clean Crawled Corpus.). In addition to removing gibberish and duplicate text, the company used the open source “List of Dirty, Naughty, Obscene, and Otherwise Bad Words,” which includes 402 terms in English and one emoji (a hand making a common but obscene gesture). Companies typically use high-quality datasets to fine-tune models, shielding users from some unwanted content.

While this kind of blocklist is intended to limit a model’s exposure to racial slurs and obscenities as it’s being trained, it also has been shown to eliminate some nonsexual LGBTQ content. As prior research has shown, a lot gets past the filters. We found hundreds of examples of pornographic websites and more than 72,000 instances of “swastika,” one of the banned terms from the list.

Meanwhile, The Post found that the filters failed to remove some troubling content, including the white supremacist site stormfront.org No. 27,505, the anti-trans site kiwifarms.net No. 378,986, and 4chan.org No. 4,339,889, the anonymous message board known for organizing targeted harassment campaigns against individuals.

We also found threepercentpatriots.com No. 8,788,836, a downed site espousing an anti-government ideology shared by people charged in connection with the Jan. 6, 2021, attack on the U.S. Capitol. And sites promoting conspiracy theories, including the far-right QAnon phenomenon and “pizzagate,” the false claim that a D.C. pizza joint was a front for pedophiles, were also present.

Is your website training AI?

A web crawl may sound like a copy of the entire internet, but it’s just a snapshot, capturing content from a sampling of webpages at a particular moment in time. C4 began as a scrape performed in April 2019 by the nonprofit CommonCrawl, a popular resource for AI models. CommonCrawl told The Post that it tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content.

The websites in Google’s C4 dataset

Search for a website
Page
1 of 0
RankDomainCategoryPercent of
all tokens
Page
1 of 0

The Post believes it is important to present the complete contents of the data fed into AI models, which promise to govern many aspects of modern life. Some websites in this data set contain highly offensive language and we have attempted to mask these words. Objectionable content may remain.

Note: Some websites were unable to to be categorized and, in many cases, are no longer accessible.

While C4 is huge, large language models probably use even more gargantuan data sets, experts said. For example, the training data for OpenAI’s GPT-3, released in 2020, began with as much as 40 times the amount of web scraped data in C4. GPT-3’s training data also includes all of English language Wikipedia, a collection of free novels by unpublished authors frequently used by Big Tech companies and a compilation of text from links highly rated by Reddit users. (Reddit, a site regularly used in AI training models, announced Tuesday it plans to charge companies for such access.)

[Quiz: Did AI make this? Test your knowledge.]

Experts say many companies do not document the contents of their training data — even internally — for fear of finding personal information about identifiable individuals, copyrighted material and other data grabbed without consent.

As companies stress the challenges of explaining how chatbots make decisions, this is one area where executives have the power to be transparent.

correction

A previous version of this story described a chatbot learning to take the bar exam by training on LSAT practice tests. The LSAT is a separate test from the bar exam. The article has been corrected.

About this story

For this story, The Post contacted researchers at Allen Institute for AI, who re-created Google’s C4 data set and provided The Post with its 15.7 million domains. The Post cleaned and analyzed this data in a few ways.

Many websites have separate domains for their mobile versions (i.e., “en.m.wikipedia.org” and “en.wikipedia.org”). We treated these as the same domain. We also combined subdomains aimed at specific languages, so “en.wikipedia.org” became “wikipedia.org.”

This left 15.1 million unique domains.

Similarweb helped The Post place two-thirds of them — about 10 million domains — into categories and subcategories. (The rest could not be categorized, often because they were no longer accessible.) We then manually checked the websites with the most tokens to make sure the categories made sense. We also combined many of the smallest subcategories.

Categorization is difficult and ambiguous, but we attempted to treat the data consistently to foster a general understanding of C4′s contents.

Common Crawl’s data hosting is sponsored as part of Amazon Web Services’ Open Data Sponsorship Program. Amazon founder Jeff Bezos owns The Washington Post.

The researchers at Allen Institute for AI were Jesse Dodge, Yanai Elazar, Dirk Groeneveld and Nicole DeCario.

Illustration by Talia Trackim.

Editing by Kate Rabinowitz, Alexis Sobel Fitts and Karly Domb Sadof.