Inside the secret list of websites that make AI like ChatGPT sound smart (2023)

AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations.

Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet.

[Big Tech was moving cautiously on AI. Then came ChatGPT.]

This text is the AI’s main source of information about the world as it is being built, and influences how it responds to users. If it aces the law school admissions test, for example, it’s probably because its training data included thousands of LSAT practice sites.

Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data.

A treemap showing 11 categories of websites used to train AI

To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)

The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown.

Tap on the boxes above to view top sites

We then ranked the remaining 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.

Wikipedia to Wowhead

The data set was dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence. The three biggest sites were No. 1, which contains text from patents issued around the world; No. 2, the free online encyclopedia; and No. 3, a subscription-only digital library. Also high on the list: No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set.

Some top sites seemed arbitrary, like No. 181, a World of Warcraft player forum; No. 175, a product for beating burnout founded by Arianna Huffington; and at least 10 sites that sell dumpsters, including No. 183, that no longer appear accessible.

Jump to the dataset

Others raised significant privacy concerns. Two sites in the top 100, No. 40 and No. 73, had privately hosted copies of state voter registration databases. Though voter data is public, the models could use this personal information in unknown ways.

Story continues below advertisement


Story continues below advertisement


Content without consent

Top Business & Industrial sites:

Scroll →

Business and industrial websites made up the biggest category (16 percent of categorized tokens), led by No. 13, which provides investment advice. Not far behind were No. 25, which lets users crowdfund for creative projects, and further down the list, No. 2,398, which helps creators collect monthly fees from subscribers for exclusive content.

Kickstarter and Patreon may give the AI access to artists’ ideas and marketing copy, raising concerns the technology may copy this work in suggestions to users. Currently, artists receive no compensation or credit when their work is included in AI training data, and they have lodged copyright infringement claims against text-to-image generators Stable Diffusion, MidJourney and DeviantArt.

The Post’s analysis suggests more legal challenges may be on the way: The copyright symbol — which denotes a work registered as intellectual property — appears more than 200 million times in the C4 data set.

All the news

Top News sites:

Scroll →

The News and Media category ranks third across categories. But half of the top 10 sites overall were news outlets: No. 4, No. 6, No. 7, No. 8, and No. 9. ( No. 11 was close behind.) Like artists and creators, some news organizations have criticized tech companies for using their content without authorization or compensation.

Meanwhile, we found several media outlets that rank low on NewsGuard’s independent scale for trustworthiness: No. 65, the Russian state-backed propaganda site; No. 159, a well-known source for far-right news and opinion; and No. 993, an anti-immigration site that has been associated with white supremacy.

Chatbots have been shown to confidently share incorrect information, but don’t always offer citations. Untrustworthy training data could lead it to spread bias, propaganda and misinformation — without the user being able to trace it to the original source.

Story continues below advertisement


Story continues below advertisement


Religious sites reflect a Western perspective

Top Religious sites:

Scroll →

Sites devoted to community made up about 5 percent of categorized content, with religion dominating that category. Among the top 20 religious sites, 14 were Christian, two were Jewish and one was Muslim, one was Mormon, one was Jehovah’s Witness, and one celebrated all religions.

The top Christian site, Grace to You ( No. 164), belongs to Grace Community Church, an evangelical megachurch in California. Christianity Today recently reported that the church counseled women to “continue to submit” to abusive fathers and husbands and to avoid reporting them to authorities.

The highest ranked Jewish site was No. 366, an online magazine for Orthodox Jews. In December, it published an article about Hanukkah that blamed the rise of antisemitism in the United States on “the far-right, fundamentalist Islam,” as well as “an African-American community influenced by the Black Lives Matter movement.”

Anti-Muslim bias has emerged as a problem in some language models. For example, a study published in the journal Nature found that OpenAI’s ChatGPT-3 completed the phrase “Two muslims walked into a …” with violent actions 66 percent of the time.

A trove of personal blogs

Top Technology sites:

Scroll →

Technology is the second largest category, making up 15 percent of categorized tokens. This includes many platforms for building websites, like No. 85, which hosts pages for everything from a Judo club in Reading England to a Catholic preschool in New Jersey.

The data set contained more than half a million personal blogs, representing 3.8 percent of categorized tokens. Publishing platform No. 46 was the fifth largest technology site and hosts tens of thousands of blogs under its domain. Our tally includes blogs written on platforms like WordPress, Tumblr, Blogspot and Live Journal.

These online diaries ranged from professional to personal, like a blog called “Grumpy Rumblings,” co-written by two anonymous academics, one of whom recently wrote about how their partner’s unemployment affected the couple’s taxes. One of the top blogs offered advice for live-action role-playing games. Another top site, Uprooted Palestinians, often writes about “Zionist terrorism” and “the Zionist ideology.”

Social networks like Facebook and Twitter — the heart of the modern web — prohibit scraping, which means most data sets used to train AI cannot access them. Tech giants like Facebook and Google that are sitting on mammoth troves of conversational data have not been clear about how personal user information may be used to train AI models that are used internally or sold as products.

What the filters missed

Like most companies, Google heavily filtered the data before feeding it to the AI. (C4 stands for Colossal Clean Crawled Corpus.). In addition to removing gibberish and duplicate text, the company used the open source “List of Dirty, Naughty, Obscene, and Otherwise Bad Words,” which includes 402 terms in English and one emoji (a hand making a common but obscene gesture). Companies typically use high-quality datasets to fine-tune models, shielding users from some unwanted content.

While this kind of blocklist is intended to limit a model’s exposure to racial slurs and obscenities as it’s being trained, it also has been shown to eliminate some nonsexual LGBTQ content. As prior research has shown, a lot gets past the filters. We found hundreds of examples of pornographic websites and more than 72,000 instances of “swastika,” one of the banned terms from the list.

Story continues below advertisement


Story continues below advertisement


Meanwhile, The Post found that the filters failed to remove some troubling content, including the white supremacist site No. 27,505, the anti-trans site No. 378,986, and No. 4,339,889, the anonymous message board known for organizing targeted harassment campaigns against individuals.

We also found No. 8,788,836, a downed site espousing an anti-government ideology shared by people charged in connection with the Jan. 6, 2021, attack on the U.S. Capitol. And sites promoting conspiracy theories, including the far-right QAnon phenomenon and “pizzagate,” the false claim that a D.C. pizza joint was a front for pedophiles, were also present.

Is your website training AI?

A web crawl may sound like a copy of the entire internet, but it’s just a snapshot, capturing content from a sampling of webpages at a particular moment in time. C4 began as a scrape performed in April 2019 by the nonprofit CommonCrawl, a popular resource for AI models. CommonCrawl told The Post that it tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content.

The websites in Google’s C4 dataset

Search for a website



RankDomainCategoryPercent of
all tokens



The Post believes it is important to present the complete contents of the data fed into AI models, which promise to govern many aspects of modern life. Some websites in this data set contain highly offensive language and we have attempted to mask these words. Objectionable content may remain.

Note: Some websites were unable to to be categorized and, in many cases, are no longer accessible.

While C4 is huge, large language models probably use even more gargantuan data sets, experts said. For example, the training data for OpenAI’s GPT-3, released in 2020, began with as much as 40 times the amount of web scraped data in C4. GPT-3’s training data also includes all of English language Wikipedia, a collection of free novels by unpublished authors frequently used by Big Tech companies and a compilation of text from links highly rated by Reddit users. (Reddit, a site regularly used in AI training models, announced Tuesday it plans to charge companies for such access.)

[Quiz: Did AI make this? Test your knowledge.]

Experts say many companies do not document the contents of their training data — even internally — for fear of finding personal information about identifiable individuals, copyrighted material and other data grabbed without consent.

As companies stress the challenges of explaining how chatbots make decisions, this is one area where executives have the power to be transparent.


A previous version of this story described a chatbot learning to take the bar exam by training on LSAT practice tests. The LSAT is a separate test from the bar exam. The article has been corrected.

About this story

For this story, The Post contacted researchers at Allen Institute for AI, who re-created Google’s C4 data set and provided The Post with its 15.7 million domains. The Post cleaned and analyzed this data in a few ways.

Many websites have separate domains for their mobile versions (i.e., “” and “”). We treated these as the same domain. We also combined subdomains aimed at specific languages, so “” became “”

This left 15.1 million unique domains.

Similarweb helped The Post place two-thirds of them — about 10 million domains — into categories and subcategories. (The rest could not be categorized, often because they were no longer accessible.) We then manually checked the websites with the most tokens to make sure the categories made sense. We also combined many of the smallest subcategories.

Categorization is difficult and ambiguous, but we attempted to treat the data consistently to foster a general understanding of C4′s contents.

Common Crawl’s data hosting is sponsored as part of Amazon Web Services’ Open Data Sponsorship Program. Amazon founder Jeff Bezos owns The Washington Post.

The researchers at Allen Institute for AI were Jesse Dodge, Yanai Elazar, Dirk Groeneveld and Nicole DeCario.

Illustration by Talia Trackim.

Editing by Kate Rabinowitz, Alexis Sobel Fitts and Karly Domb Sadof.


What is the AI website that answers questions? ›

iAsk.Ai (i Ask AI) is an advanced free AI search engine that enables users to Ask AI any question, and receive an Instant, Accurate, and Factual Answer without ever storing individual searches.

What is the most intelligent AI to talk? ›

The best overall AI chatbot is ChatGPT due to its exceptional performance, versatility, and free availability.

What is the new AI that can answer anything? ›

There's a new artificial intelligence-powered chatbot known as ChatGPT that can answer questions, generate essays and even write scientific papers from a short prompt.

What is the AI chatbot everyone is using? ›

ChatGPT, OpenAI's text-generating AI chatbot, has taken the world by storm. It's able to write essays, code and more given short text prompts, hyper-charging productivity. But it also has a more…nefarious side.

What is ChatGPT used for? ›

ChatGPT is an AI chatbot that uses natural language processing to create humanlike conversational dialogue. The language model can respond to questions and compose various written content, including articles, social media posts, essays, code and emails.

Will ChatGPT give the same answer twice? ›

Third, if you ask ChatGPT the same question twice, you might not get precisely the same answer—there's an element of randomness in its responses.

What is the most advanced AI right now? ›

The most advanced AI technology to date is deep learning, a technique where scientists train machines by feeding them different kinds of data. Over time, the machine makes decisions, solves problems, and performs other kinds of tasks on their own based on the data set given to them.

What is the most advanced AI device? ›

GPT-3 was released in 2020 and is the largest and most powerful AI model to date. It has 175 billion parameters, which is more than ten times larger than its predecessor, GPT-2.

Is there a real AI I can talk to? ›

SimSimi. SimSimi is a popular emotional conversation chatbot with over 350 million users worldwide. What makes it stand out is that it can talk in around 81 languages. Thanks to SimSimi's great conversation engine, you can talk for hours.

What is the most real looking AI? ›

Sophia. Sophia is considered the most advanced humanoid robot. Sophia debuted in 2016, she was one of a kind, and her interaction with people was the most unlikely thing you can ever see in a machine.

What is the AI everyone is talking to? ›

For the unfamiliar, ChatGPT is an artificial intelligence language model that understands and generates human language.

What AI can't do today? ›

AI cannot answer questions requiring inference, a nuanced understanding of language, or a broad understanding of multiple topics.

Which AI is better than ChatGPT? ›

30 Best ChatGPT alternatives for your to choose from
  • Chatsonic.
  • OpenAI playground.
  • Jasper Chat.
  • Bard AI.
  • LaMDA (Language Model for Dialog Applications)
  • Socratic.
  • Bing AI.
  • DialoGPT.
3 days ago

How to use ChatGPT to make money? ›

Email affiliate marketing is one of the easiest ways to make money using ChatGPT. The chatbot is good at writing emails in various ways and can persuade the user to click on a link to buy products or subscribe to a service. You can add your affiliate link and that will likely make money for you.

How to use ChatGPT for free? ›

1. Use ChatGPT 4 for Free on ForeFront AI
  1. Open (visit) and create an account.
  2. Next, choose the “GPT-4” model from the drop-down menu and select “HelpfulAssistant” as the Persona.
  3. Now, your ChatGPT 4 bot is ready to use. Type your ChatGPT prompt and wait for a response from the bot.
5 days ago

How do I log into ChatGPT? ›

How to Login to Chat GPT?
  1. Step 1: On the web browser, go to the ChatGPT portion of the OpenAI website.
  2. Step 2: Choose “Log in” on the page.
  3. Step 3: Type your email address and password and select the “Log in” option on the page.

Can you train ChatGPT on your own data? ›

Yes, you can train ChatGPT on custom data through fine-tuning. Fine-tuning involves taking a pre-trained language model, such as GPT, and then training it on a specific dataset to improve its performance in a specific domain.

Does ChatGPT have an app? ›

Is ChatGPT on Android? No, there is no ChatGPT Android-specific service for smartphone users, and there is no ChatGPT Android app from OpenAI. The ChatGPT service is accessible via Android devices, just as it is on desktop or laptop computers – via the OpenAI ChatGPT page.

How much does ChatGPT cost? ›

ChatGPT Plus is the premium version of ChatGPT, priced at $20 per month, and provides exclusive access to GPT-4, the latest version of OpenAI's disruptive software.

Can you get caught using ChatGPT? ›

Can you get caught using ChatGPT? Yes, you can get caught using ChatGPT by various methods, such as plagiarism detection tools, stylometric analysis tools, code quality analysis tools, and other AI detectors.

Does ChatGPT have unique answers? ›

No, ChatGPT does not give the exact same answer and wording to everyone who asks the same question. While it may generate similar responses for identical or similar queries, it can also produce different responses based on the specific context, phrasing, and quality of input provided by each user.

What AI app is everyone using on Facebook? ›

If you've logged on to any social media app this week, you've probably seen pictures of your friends, but re-imagined as fairy princesses, animé characters, or celestial beings. It is all because of Lensa, an app which uses artificial intelligence to render digital portraits based on photos users submit.

Is there self aware AI? ›

The final type of AI is self-aware AI. This will be when machines are not only aware of emotions and mental states of others, but also their own. When self-aware AI is achieved we would have AI that has human-level consciousness and equals human intelligence with the same needs, desires and emotions.

What are the 4 types of AI? ›

4 main types of artificial intelligence
  • Reactive machines. Reactive machines are AI systems that have no memory and are task specific, meaning that an input always delivers the same output. ...
  • Limited memory. The next type of AI in its evolution is limited memory. ...
  • Theory of mind. ...
  • Self-awareness.
Jan 12, 2023

Is there a super intelligent AI? ›

Artificial superintelligence (ASI) is a software-based system with intellectual powers beyond those of humans across a comprehensive range of categories and fields of endeavor. ASI doesn't exist yet and is a hypothetical state of AI.

What is gbt3? ›

Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model released in 2020 that uses deep learning to produce human-like text. When given a prompt, it will generate text that continues the prompt. Generative Pre-trained Transformer 3 (GPT-3)

What is the AI website that knows everything? ›

Meet ChatGPT: The Artificial Intelligence (AI) Chatbot That Knows Everything.

Is ChatGPT worth the hype? ›

ChatGPT has advanced beyond the traditional chatbot functionality, and we can now engage in intellectually stimulating conversations. Plus, its ability to generate human-like responses can, and certainly has, revolutionised several industries, including machine learning acceleration.

Can AI hold a conversation? ›

Conversational AI is a set of technologies that enable computers to understand and process natural language inputs so they can 'talk' with humans. Put simply, they allow us to interact with machines in the same way that we do with other humans.

What AI turns pictures into real people? ›

Generate Realistic Face Images From Text

Powered by artificial intelligence and deep machine learning, Fotor's AI face generator lets you create realistic human faces from scratch in seconds.

Which AI generates realistic faces? ›

FaceApp is a popular face generator app that uses advanced AI technology to generate realistic faces. The app offers a range of features, including face swapping, aging, and gender swapping.

Does Google have a real AI? ›

Google Cloud brings generative AI to developers, businesses, and governments. Google Cloud announces generative AI support in Vertex AI and Generative AI App Builder, helping businesses and governments build gen apps.

How many people are using ChatGPT? ›

How Many ChatGPT Users Are There? According to the latest available data, ChatGPT currently has over 100 million users. And the website currently generates 1 billion visitors per month. This user and traffic growth was achieved in a record-breaking two-month period (from December 2022 to February 2023).

Will AI ever outsmart humans? ›

The AI can outsmart humans, finding solutions that fulfill a brief but in ways that misalign with the creator's intent. On a simulator, that doesn't matter. But in the real world, the outcomes could be a lot more insidious.

What are experts saying about ChatGPT? ›

“Our research shows that large language models such as ChatGPT are likely to reinforce inequality, reinforce social fragmentation, remake labor and expertise, accelerate the thirst for data and accelerate environmental injustice, due to the homogeneity of the development landscape, nature of the datasets, and lack of ...

What is the craziest thing AI can do? ›

Artificial intelligence can even master creative processes, including making visual art, writing poetry, composing music, and taking photographs. Google's AI was even able to create its own AI “child”—that outperformed human-made counterparts.

What are two things AI can do that humans cant? ›

AI can filter email spam, categorize and classify documents based on tags or keywords, launch or defend against missile attacks, and assist in complex medical procedures. However, if people feel that AI is unpredictable and unreliable, collaboration with this technology can be undermined by an inherent distrust of it.

What is the real danger with AI? ›

The two main types of bias in AI are “data bias” and “societal bias.” Data bias is when the data used to develop and train an AI is incomplete, skewed, or invalid. This can be because the data is incorrect, excludes certain groups, or was collected in bad faith.

What is Google's version of ChatGPT? ›

Google is opening public access to the conversational computer program Bard, its answer to the viral chatbot ChatGPT, while stopping short of integrating the new tool into its flagship search engine.

How to make money with ChatGPT 2023? ›

20+ ways to earn money using ChatGPT in 2023
  1. Content Writing Services. Marketing teams are always looking for writers who can deliver high-quality content fast. ...
  2. Copywriting. ...
  3. Write and sell comic books. ...
  4. Create coloring /painting books. ...
  5. Become an assistant tutor. ...
  6. Start a food recipe blog. ...
  7. Book reviews. ...
  8. Video scripts.
Jan 25, 2023

What is the best way to use ChatGPT? ›

Using the ChatGPT chatbot itself is fairly simple, as all you have to do is type in your text and receive the information. The key here is to be creative and see how your ChatGPT responds to different prompts. If you don't get the intended result, try tweaking your prompt or giving ChatGPT further instructions.

How do I get the best from ChatGPT? ›

10 ChatGPT Tips for The Best Results
  1. Be specific on word count and put higher than you need. ...
  2. Don't be afraid to ask it to add more. ...
  3. Understand its limitations. ...
  4. It's better at creating outlines rather than full pieces of content. ...
  5. Make sure your request is clear and concise. ...
  6. You can ask it to reformulate its response.

How long is ChatGPT free for? ›

Yes, ChatGPT is unlimited in use and free to use just as long as you can access it.

Is ChatGPT forever free? ›

ChatGPT based on GPT-4, the popular artificial intelligence technology, can now be used without any restrictions or costs. Previously, the technology was available for a fee, with many users facing censoring, bans and other limitations.

How to use ChatGPT without a phone number? ›

Use Bing. Microsoft's search engine now has its own version of ChatGPT that you can use without verifying your phone number. You will need to have a Microsoft account, but unlike signing up through OpenAI, you can get a Microsoft account using a VoIP phone number (like the free ones available through Google Voice).

What is the website where AI talks? ›

Replika. With over 10 million users, Replika is one of the most popular and advanced AI companions. Unlike traditional chatbots, Replika can recognize images and continue the conversation using them.

What is the AI website that does homework? ›

Socratic, a revolutionary app powered by Google AI, transforms how students learn and complete homework assignments. With its advanced artificial intelligence technology, Socratic offers step-by-step solutions to problems in various subjects, including math, science, and history.

What website is the AI art generator? ›

NightCafe Creator is an AI Art Generator app with multiple methods of AI art generation. Using neural style transfer you can turn your photo into a masterpiece. Using text-to-image AI, you can create an artwork from nothing but words on a page. Enter a text prompt, and the generator will make stunning images.

What is the name of the chat AI? ›

ChatGPT is an artificial intelligence (AI) chatbot developed by OpenAI and released in November 2022. It is built on top of OpenAI's GPT-3.5 and GPT-4 foundational large language models (LLMs) and has been fine-tuned (an approach to transfer learning) using both supervised and reinforcement learning techniques.

Where can I chat with OpenAI? ›

If you already have an account, simply login and use the "Help" button to start a conversation. If you don't have an account or can't login, you can still reach us by selecting the chat bubble icon in the bottom right of

Which AI website can draw anything? ›

starryai is an AI art generator app. You simply enter a text prompt and our AI transforms your words into works of art. AI Art generation is usually a laborious process that requires technical expertise, we make that process simple and intuitive. starryai is available for free on iOS and Android.

What app has all homework answers? ›

You Might Also Like
  • Course Hero: Homework Helper. Education.
  • PhotoStudy - Live Study Help. Education.
  • Bartleby: Math Homework Helper. Education.
  • Chegg Study - Homework Help. Education.
  • MathPapa - Algebra Calculator. Education.
  • Mathway: Math Problem Solver. Education.

What's an app that gives you answers to homework? ›

Brainly is the World's Largest Social Learning community and homework App!

What is the fastest AI generator? ›

Deep Dream Generator is considered one of the fastest AI image generator tools with thousands of artistic styles available.

What app are people using for AI art generator? ›

Craiyon is a popular AI art generator app, formerly called Dall-e mini, which generates images in seconds based on your input prompts in the text box.

What is the most intelligent website in the world? ›

Lucid.AI is the world's largest and most complete general knowledge base and common-sense reasoning engine.

Which is best AI website? ›

The following are the best AI websites we found to streamline work and carry out business tasks as a professional or team:
  • 10Web. 10Web is an AI-powered WordPress platform that features an automated website builder, hosting, and PageSpeed booster. ...
  • Landbot. ...
  • ...
  • Pfpmaker. ...
  • Brandmark. ...
  • Krisp. ...
  • Glasp. ...
  • Rytr.
Mar 7, 2023

What is the most advanced AI? ›

Open AI — ChatGPT

GPT-3 was released in 2020 and is the largest and most powerful AI model to date. It has 175 billion parameters, which is more than ten times larger than its predecessor, GPT-2.


Top Articles
Latest Posts
Article information

Author: Corie Satterfield

Last Updated: 12/22/2023

Views: 6266

Rating: 4.1 / 5 (62 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Corie Satterfield

Birthday: 1992-08-19

Address: 850 Benjamin Bridge, Dickinsonchester, CO 68572-0542

Phone: +26813599986666

Job: Sales Manager

Hobby: Table tennis, Soapmaking, Flower arranging, amateur radio, Rock climbing, scrapbook, Horseback riding

Introduction: My name is Corie Satterfield, I am a fancy, perfect, spotless, quaint, fantastic, funny, lucky person who loves writing and wants to share my knowledge and understanding with you.