AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations.
Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet.
[Big Tech was moving cautiously on AI. Then came ChatGPT.]
This text is the AI’s main source of information about the world as it is being built, and influences how it responds to users. If it aces the law school admissions test, for example, it’s probably because its training data included thousands of LSAT practice sites.
Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data.
A treemap showing 11 categories of websites used to train AI
To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)
The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown.
Tap on the boxes above to view top sites
We then ranked the remaining 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.
Wikipedia to Wowhead
The data set was dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence. The three biggest sites were patents.google.com No. 1, which contains text from patents issued around the world; wikipedia.org No. 2, the free online encyclopedia; and scribd.com No. 3, a subscription-only digital library. Also high on the list: b-ok.org No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set.
Some top sites seemed arbitrary, like wowhead.com No. 181, a World of Warcraft player forum; thriveglobal.com No. 175, a product for beating burnout founded by Arianna Huffington; and at least 10 sites that sell dumpsters, including dumpsteroid.com No. 183, that no longer appear accessible.
Jump to the dataset
Others raised significant privacy concerns. Two sites in the top 100, coloradovoters.info No. 40 and flvoters.com No. 73, had privately hosted copies of state voter registration databases. Though voter data is public, the models could use this personal information in unknown ways.
Story continues below advertisement
Story continues below advertisement
Content without consent
Top Business & Industrial sites:
Business and industrial websites made up the biggest category (16 percent of categorized tokens), led by fool.com No. 13, which provides investment advice. Not far behind were kickstarter.com No. 25, which lets users crowdfund for creative projects, and further down the list, patreon.com No. 2,398, which helps creators collect monthly fees from subscribers for exclusive content.
Kickstarter and Patreon may give the AI access to artists’ ideas and marketing copy, raising concerns the technology may copy this work in suggestions to users. Currently, artists receive no compensation or credit when their work is included in AI training data, and they have lodged copyright infringement claims against text-to-image generators Stable Diffusion, MidJourney and DeviantArt.
The Post’s analysis suggests more legal challenges may be on the way: The copyright symbol — which denotes a work registered as intellectual property — appears more than 200 million times in the C4 data set.
All the news
Top News sites:
The News and Media category ranks third across categories. But half of the top 10 sites overall were news outlets: nytimes.com No. 4, latimes.com No. 6, theguardian.com No. 7, forbes.com No. 8, and huffpost.com No. 9. (Washingtonpost.com No. 11 was close behind.) Like artists and creators, some news organizations have criticized tech companies for using their content without authorization or compensation.
Meanwhile, we found several media outlets that rank low on NewsGuard’s independent scale for trustworthiness: RT.com No. 65, the Russian state-backed propaganda site; breitbart.com No. 159, a well-known source for far-right news and opinion; and vdare.com No. 993, an anti-immigration site that has been associated with white supremacy.
Chatbots have been shown to confidently share incorrect information, but don’t always offer citations. Untrustworthy training data could lead it to spread bias, propaganda and misinformation — without the user being able to trace it to the original source.
Story continues below advertisement
Story continues below advertisement
Religious sites reflect a Western perspective
Top Religious sites:
Sites devoted to community made up about 5 percent of categorized content, with religion dominating that category. Among the top 20 religious sites, 14 were Christian, two were Jewish and one was Muslim, one was Mormon, one was Jehovah’s Witness, and one celebrated all religions.
The top Christian site, Grace to You (gty.org No. 164), belongs to Grace Community Church, an evangelical megachurch in California. Christianity Today recently reported that the church counseled women to “continue to submit” to abusive fathers and husbands and to avoid reporting them to authorities.
The highest ranked Jewish site was jewishworldreview.com No. 366, an online magazine for Orthodox Jews. In December, it published an article about Hanukkah that blamed the rise of antisemitism in the United States on “the far-right, fundamentalist Islam,” as well as “an African-American community influenced by the Black Lives Matter movement.”
Anti-Muslim bias has emerged as a problem in some language models. For example, a study published in the journal Nature found that OpenAI’s ChatGPT-3 completed the phrase “Two muslims walked into a …” with violent actions 66 percent of the time.
A trove of personal blogs
Top Technology sites:
Technology is the second largest category, making up 15 percent of categorized tokens. This includes many platforms for building websites, like sites.google.com No. 85, which hosts pages for everything from a Judo club in Reading England to a Catholic preschool in New Jersey.
The data set contained more than half a million personal blogs, representing 3.8 percent of categorized tokens. Publishing platform medium.com No. 46 was the fifth largest technology site and hosts tens of thousands of blogs under its domain. Our tally includes blogs written on platforms like WordPress, Tumblr, Blogspot and Live Journal.
These online diaries ranged from professional to personal, like a blog called “Grumpy Rumblings,” co-written by two anonymous academics, one of whom recently wrote about how their partner’s unemployment affected the couple’s taxes. One of the top blogs offered advice for live-action role-playing games. Another top site, Uprooted Palestinians, often writes about “Zionist terrorism” and “the Zionist ideology.”
Social networks like Facebook and Twitter — the heart of the modern web — prohibit scraping, which means most data sets used to train AI cannot access them. Tech giants like Facebook and Google that are sitting on mammoth troves of conversational data have not been clear about how personal user information may be used to train AI models that are used internally or sold as products.
What the filters missed
Like most companies, Google heavily filtered the data before feeding it to the AI. (C4 stands for Colossal Clean Crawled Corpus.). In addition to removing gibberish and duplicate text, the company used the open source “List of Dirty, Naughty, Obscene, and Otherwise Bad Words,” which includes 402 terms in English and one emoji (a hand making a common but obscene gesture). Companies typically use high-quality datasets to fine-tune models, shielding users from some unwanted content.
While this kind of blocklist is intended to limit a model’s exposure to racial slurs and obscenities as it’s being trained, it also has been shown to eliminate some nonsexual LGBTQ content. As prior research has shown, a lot gets past the filters. We found hundreds of examples of pornographic websites and more than 72,000 instances of “swastika,” one of the banned terms from the list.
Story continues below advertisement
Story continues below advertisement
Meanwhile, The Post found that the filters failed to remove some troubling content, including the white supremacist site stormfront.org No. 27,505, the anti-trans site kiwifarms.net No. 378,986, and 4chan.org No. 4,339,889, the anonymous message board known for organizing targeted harassment campaigns against individuals.
We also found threepercentpatriots.com No. 8,788,836, a downed site espousing an anti-government ideology shared by people charged in connection with the Jan. 6, 2021, attack on the U.S. Capitol. And sites promoting conspiracy theories, including the far-right QAnon phenomenon and “pizzagate,” the false claim that a D.C. pizza joint was a front for pedophiles, were also present.
Is your website training AI?
A web crawl may sound like a copy of the entire internet, but it’s just a snapshot, capturing content from a sampling of webpages at a particular moment in time. C4 began as a scrape performed in April 2019 by the nonprofit CommonCrawl, a popular resource for AI models. CommonCrawl told The Post that it tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content.
The websites in Google’s C4 dataset
Search for a website
|Rank||Domain||Category||Percent of |
The Post believes it is important to present the complete contents of the data fed into AI models, which promise to govern many aspects of modern life. Some websites in this data set contain highly offensive language and we have attempted to mask these words. Objectionable content may remain.
Note: Some websites were unable to to be categorized and, in many cases, are no longer accessible.
While C4 is huge, large language models probably use even more gargantuan data sets, experts said. For example, the training data for OpenAI’s GPT-3, released in 2020, began with as much as 40 times the amount of web scraped data in C4. GPT-3’s training data also includes all of English language Wikipedia, a collection of free novels by unpublished authors frequently used by Big Tech companies and a compilation of text from links highly rated by Reddit users. (Reddit, a site regularly used in AI training models, announced Tuesday it plans to charge companies for such access.)
[Quiz: Did AI make this? Test your knowledge.]
Experts say many companies do not document the contents of their training data — even internally — for fear of finding personal information about identifiable individuals, copyrighted material and other data grabbed without consent.
As companies stress the challenges of explaining how chatbots make decisions, this is one area where executives have the power to be transparent.
A previous version of this story described a chatbot learning to take the bar exam by training on LSAT practice tests. The LSAT is a separate test from the bar exam. The article has been corrected.
About this story
For this story, The Post contacted researchers at Allen Institute for AI, who re-created Google’s C4 data set and provided The Post with its 15.7 million domains. The Post cleaned and analyzed this data in a few ways.
Many websites have separate domains for their mobile versions (i.e., “en.m.wikipedia.org” and “en.wikipedia.org”). We treated these as the same domain. We also combined subdomains aimed at specific languages, so “en.wikipedia.org” became “wikipedia.org.”
This left 15.1 million unique domains.
Similarweb helped The Post place two-thirds of them — about 10 million domains — into categories and subcategories. (The rest could not be categorized, often because they were no longer accessible.) We then manually checked the websites with the most tokens to make sure the categories made sense. We also combined many of the smallest subcategories.
Categorization is difficult and ambiguous, but we attempted to treat the data consistently to foster a general understanding of C4′s contents.
Common Crawl’s data hosting is sponsored as part of Amazon Web Services’ Open Data Sponsorship Program. Amazon founder Jeff Bezos owns The Washington Post.
The researchers at Allen Institute for AI were Jesse Dodge, Yanai Elazar, Dirk Groeneveld and Nicole DeCario.
Illustration by Talia Trackim.
Editing by Kate Rabinowitz, Alexis Sobel Fitts and Karly Domb Sadof.
iAsk.Ai (i Ask AI) is an advanced free AI search engine that enables users to Ask AI any question, and receive an Instant, Accurate, and Factual Answer without ever storing individual searches.What is the most intelligent AI to talk? ›
The best overall AI chatbot is ChatGPT due to its exceptional performance, versatility, and free availability.What is the new AI that can answer anything? ›
There's a new artificial intelligence-powered chatbot known as ChatGPT that can answer questions, generate essays and even write scientific papers from a short prompt.What is the AI chatbot everyone is using? ›
ChatGPT, OpenAI's text-generating AI chatbot, has taken the world by storm. It's able to write essays, code and more given short text prompts, hyper-charging productivity. But it also has a more…nefarious side.What is ChatGPT used for? ›
ChatGPT is an AI chatbot that uses natural language processing to create humanlike conversational dialogue. The language model can respond to questions and compose various written content, including articles, social media posts, essays, code and emails.Will ChatGPT give the same answer twice? ›
Third, if you ask ChatGPT the same question twice, you might not get precisely the same answer—there's an element of randomness in its responses.What is the most advanced AI right now? ›
The most advanced AI technology to date is deep learning, a technique where scientists train machines by feeding them different kinds of data. Over time, the machine makes decisions, solves problems, and performs other kinds of tasks on their own based on the data set given to them.What is the most advanced AI device? ›
GPT-3 was released in 2020 and is the largest and most powerful AI model to date. It has 175 billion parameters, which is more than ten times larger than its predecessor, GPT-2.Is there a real AI I can talk to? ›
SimSimi. SimSimi is a popular emotional conversation chatbot with over 350 million users worldwide. What makes it stand out is that it can talk in around 81 languages. Thanks to SimSimi's great conversation engine, you can talk for hours.What is the most real looking AI? ›
Sophia. Sophia is considered the most advanced humanoid robot. Sophia debuted in 2016, she was one of a kind, and her interaction with people was the most unlikely thing you can ever see in a machine.
For the unfamiliar, ChatGPT is an artificial intelligence language model that understands and generates human language.What AI can't do today? ›
AI cannot answer questions requiring inference, a nuanced understanding of language, or a broad understanding of multiple topics.Which AI is better than ChatGPT? ›
- OpenAI playground.
- Jasper Chat.
- Bard AI.
- LaMDA (Language Model for Dialog Applications)
- Bing AI.
Email affiliate marketing is one of the easiest ways to make money using ChatGPT. The chatbot is good at writing emails in various ways and can persuade the user to click on a link to buy products or subscribe to a service. You can add your affiliate link and that will likely make money for you.How to use ChatGPT for free? ›
- Open chat.forefront.ai (visit) and create an account.
- Next, choose the “GPT-4” model from the drop-down menu and select “HelpfulAssistant” as the Persona.
- Now, your ChatGPT 4 bot is ready to use. Type your ChatGPT prompt and wait for a response from the bot.
- Step 1: On the web browser, go to the ChatGPT portion of the OpenAI website.
- Step 2: Choose “Log in” on the page.
- Step 3: Type your email address and password and select the “Log in” option on the page.
Yes, you can train ChatGPT on custom data through fine-tuning. Fine-tuning involves taking a pre-trained language model, such as GPT, and then training it on a specific dataset to improve its performance in a specific domain.Does ChatGPT have an app? ›
Is ChatGPT on Android? No, there is no ChatGPT Android-specific service for smartphone users, and there is no ChatGPT Android app from OpenAI. The ChatGPT service is accessible via Android devices, just as it is on desktop or laptop computers – via the OpenAI ChatGPT page.How much does ChatGPT cost? ›
ChatGPT Plus is the premium version of ChatGPT, priced at $20 per month, and provides exclusive access to GPT-4, the latest version of OpenAI's disruptive software.Can you get caught using ChatGPT? ›
Can you get caught using ChatGPT? Yes, you can get caught using ChatGPT by various methods, such as plagiarism detection tools, stylometric analysis tools, code quality analysis tools, and other AI detectors.
No, ChatGPT does not give the exact same answer and wording to everyone who asks the same question. While it may generate similar responses for identical or similar queries, it can also produce different responses based on the specific context, phrasing, and quality of input provided by each user.What AI app is everyone using on Facebook? ›
If you've logged on to any social media app this week, you've probably seen pictures of your friends, but re-imagined as fairy princesses, animé characters, or celestial beings. It is all because of Lensa, an app which uses artificial intelligence to render digital portraits based on photos users submit.Is there self aware AI? ›
The final type of AI is self-aware AI. This will be when machines are not only aware of emotions and mental states of others, but also their own. When self-aware AI is achieved we would have AI that has human-level consciousness and equals human intelligence with the same needs, desires and emotions.What are the 4 types of AI? ›
- Reactive machines. Reactive machines are AI systems that have no memory and are task specific, meaning that an input always delivers the same output. ...
- Limited memory. The next type of AI in its evolution is limited memory. ...
- Theory of mind. ...
Artificial superintelligence (ASI) is a software-based system with intellectual powers beyond those of humans across a comprehensive range of categories and fields of endeavor. ASI doesn't exist yet and is a hypothetical state of AI.What is gbt3? ›
Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model released in 2020 that uses deep learning to produce human-like text. When given a prompt, it will generate text that continues the prompt. Generative Pre-trained Transformer 3 (GPT-3)What is the AI website that knows everything? ›
Meet ChatGPT: The Artificial Intelligence (AI) Chatbot That Knows Everything.Is ChatGPT worth the hype? ›
ChatGPT has advanced beyond the traditional chatbot functionality, and we can now engage in intellectually stimulating conversations. Plus, its ability to generate human-like responses can, and certainly has, revolutionised several industries, including machine learning acceleration.Can AI hold a conversation? ›
Conversational AI is a set of technologies that enable computers to understand and process natural language inputs so they can 'talk' with humans. Put simply, they allow us to interact with machines in the same way that we do with other humans.What AI turns pictures into real people? ›
Generate Realistic Face Images From Text
Powered by artificial intelligence and deep machine learning, Fotor's AI face generator lets you create realistic human faces from scratch in seconds.
FaceApp is a popular face generator app that uses advanced AI technology to generate realistic faces. The app offers a range of features, including face swapping, aging, and gender swapping.Does Google have a real AI? ›
Google Cloud brings generative AI to developers, businesses, and governments. Google Cloud announces generative AI support in Vertex AI and Generative AI App Builder, helping businesses and governments build gen apps.How many people are using ChatGPT? ›
How Many ChatGPT Users Are There? According to the latest available data, ChatGPT currently has over 100 million users. And the website currently generates 1 billion visitors per month. This user and traffic growth was achieved in a record-breaking two-month period (from December 2022 to February 2023).Will AI ever outsmart humans? ›
The AI can outsmart humans, finding solutions that fulfill a brief but in ways that misalign with the creator's intent. On a simulator, that doesn't matter. But in the real world, the outcomes could be a lot more insidious.What are experts saying about ChatGPT? ›
“Our research shows that large language models such as ChatGPT are likely to reinforce inequality, reinforce social fragmentation, remake labor and expertise, accelerate the thirst for data and accelerate environmental injustice, due to the homogeneity of the development landscape, nature of the datasets, and lack of ...What is the craziest thing AI can do? ›
Artificial intelligence can even master creative processes, including making visual art, writing poetry, composing music, and taking photographs. Google's AI was even able to create its own AI “child”—that outperformed human-made counterparts.What are two things AI can do that humans cant? ›
AI can filter email spam, categorize and classify documents based on tags or keywords, launch or defend against missile attacks, and assist in complex medical procedures. However, if people feel that AI is unpredictable and unreliable, collaboration with this technology can be undermined by an inherent distrust of it.What is the real danger with AI? ›
The two main types of bias in AI are “data bias” and “societal bias.” Data bias is when the data used to develop and train an AI is incomplete, skewed, or invalid. This can be because the data is incorrect, excludes certain groups, or was collected in bad faith.What is Google's version of ChatGPT? ›
Google is opening public access to the conversational computer program Bard, its answer to the viral chatbot ChatGPT, while stopping short of integrating the new tool into its flagship search engine.How to make money with ChatGPT 2023? ›
- Content Writing Services. Marketing teams are always looking for writers who can deliver high-quality content fast. ...
- Copywriting. ...
- Write and sell comic books. ...
- Create coloring /painting books. ...
- Become an assistant tutor. ...
- Start a food recipe blog. ...
- Book reviews. ...
- Video scripts.
Using the ChatGPT chatbot itself is fairly simple, as all you have to do is type in your text and receive the information. The key here is to be creative and see how your ChatGPT responds to different prompts. If you don't get the intended result, try tweaking your prompt or giving ChatGPT further instructions.How do I get the best from ChatGPT? ›
- Be specific on word count and put higher than you need. ...
- Don't be afraid to ask it to add more. ...
- Understand its limitations. ...
- It's better at creating outlines rather than full pieces of content. ...
- Make sure your request is clear and concise. ...
- You can ask it to reformulate its response.
Yes, ChatGPT is unlimited in use and free to use just as long as you can access it.Is ChatGPT forever free? ›
ChatGPT based on GPT-4, the popular artificial intelligence technology, can now be used without any restrictions or costs. Previously, the technology was available for a fee, with many users facing censoring, bans and other limitations.How to use ChatGPT without a phone number? ›
Use Bing. Microsoft's search engine now has its own version of ChatGPT that you can use without verifying your phone number. You will need to have a Microsoft account, but unlike signing up through OpenAI, you can get a Microsoft account using a VoIP phone number (like the free ones available through Google Voice).What is the website where AI talks? ›
Replika. With over 10 million users, Replika is one of the most popular and advanced AI companions. Unlike traditional chatbots, Replika can recognize images and continue the conversation using them.What is the AI website that does homework? ›
Socratic, a revolutionary app powered by Google AI, transforms how students learn and complete homework assignments. With its advanced artificial intelligence technology, Socratic offers step-by-step solutions to problems in various subjects, including math, science, and history.What website is the AI art generator? ›
NightCafe Creator is an AI Art Generator app with multiple methods of AI art generation. Using neural style transfer you can turn your photo into a masterpiece. Using text-to-image AI, you can create an artwork from nothing but words on a page. Enter a text prompt, and the generator will make stunning images.What is the name of the chat AI? ›
ChatGPT is an artificial intelligence (AI) chatbot developed by OpenAI and released in November 2022. It is built on top of OpenAI's GPT-3.5 and GPT-4 foundational large language models (LLMs) and has been fine-tuned (an approach to transfer learning) using both supervised and reinforcement learning techniques.Where can I chat with OpenAI? ›
If you already have an account, simply login and use the "Help" button to start a conversation. If you don't have an account or can't login, you can still reach us by selecting the chat bubble icon in the bottom right of help.openai.com.
starryai is an AI art generator app. You simply enter a text prompt and our AI transforms your words into works of art. AI Art generation is usually a laborious process that requires technical expertise, we make that process simple and intuitive. starryai is available for free on iOS and Android.What app has all homework answers? ›
- Course Hero: Homework Helper. Education.
- PhotoStudy - Live Study Help. Education.
- Bartleby: Math Homework Helper. Education.
- Chegg Study - Homework Help. Education.
- MathPapa - Algebra Calculator. Education.
- Mathway: Math Problem Solver. Education.
Brainly is the World's Largest Social Learning community and homework App!What is the fastest AI generator? ›
Deep Dream Generator is considered one of the fastest AI image generator tools with thousands of artistic styles available.What app are people using for AI art generator? ›
Craiyon is a popular AI art generator app, formerly called Dall-e mini, which generates images in seconds based on your input prompts in the text box.What is the most intelligent website in the world? ›
Lucid.AI is the world's largest and most complete general knowledge base and common-sense reasoning engine.Which is best AI website? ›
- 10Web. 10Web is an AI-powered WordPress platform that features an automated website builder, hosting, and PageSpeed booster. ...
- Landbot. ...
- Beautiful.ai. ...
- Pfpmaker. ...
- Brandmark. ...
- Krisp. ...
- Glasp. ...
Open AI — ChatGPT
GPT-3 was released in 2020 and is the largest and most powerful AI model to date. It has 175 billion parameters, which is more than ten times larger than its predecessor, GPT-2.