Traccion
Glossary

What is web scraping?

Web scraping is how a program reads a website. AI engines do it constantly. How well your site can be scraped is the floor on how often it gets cited.

Web scraping is the act of a program (not a human in a browser) fetching a web page, parsing the HTML, and extracting the information from it. Search engines have done this for thirty years. AI engines do it now too.

How AI engines actually scrape the web

Most operators think AI crawlers run a full browser, render JavaScript, click around, and "see" the page the way a human does. That is not what happens. The real flow:

  1. Check robots.txt at the root of the domain to see what they are allowed to crawl.
  2. Check llms.txt if it exists, for a plain-English summary of what the site is about. New convention, increasingly adopted.
  3. Fetch sitemap.xml to discover the list of URLs.
  4. HTTP GET each URL to retrieve the raw HTML. This is a plain network call, not a browser. JavaScript does not execute.
  5. Parse the HTML with a lightweight parser (cheerio in Node, BeautifulSoup or lxml in Python). Extract JSON-LD structured data first because it is the cleanest ground truth.
  6. Extract the article body using readability heuristics, stripping nav, footer, ads, sidebars.
  7. Index the cleaned text in a search index the LLM can query later when generating answers.

For SPA sites where the actual content is rendered only after JavaScript runs, AI crawlers fall back to a headless browser (Playwright, Puppeteer). This is 20 to 50 times slower than static scraping, so most crawlers limit it heavily. If your marketing copy is JavaScript-only, you are at the back of the line.

What makes a site easy or hard to scrape

Easy to scrape:

Hard to scrape:

The Traccion site is built specifically to be on the easy side. See our llms.txt and check sitemap.xml as live examples.

Web scraping for non-AEO use cases

Operators use scraping for plenty of legitimate purposes outside AEO:

For any of these, the general rule is to respect robots.txt, throttle your requests, identify your bot in the user-agent string, and skip anything behind authentication. The legal floor on US public-content scraping is more permissive than people assume, but the ethical floor is to not be a jerk to the sites you depend on.

Read more

For the deep technical answer to "fastest way for an AI to scrape a website," with real code, see the full blog post.

Common questions

What is web scraping?
Web scraping is the process of a program reading a web page, extracting structured data from the HTML, and storing or acting on that data. Browsers do something similar when they render a page, but a scraper is automated and runs at scale.
How do AI engines like ChatGPT and Claude scrape the web?
They run automated crawlers (GPTBot, ClaudeBot, PerplexityBot) that fetch HTML over HTTP, parse the static markup, extract JSON-LD structured data, and store the cleaned text in a search index. They do not typically render JavaScript at crawl time. This is why static HTML and JSON-LD matter so much for AEO.
Is web scraping legal?
Scraping public content that respects robots.txt is generally legal in the US, especially under the 2022 hiQ vs LinkedIn decision. Bypassing authentication, ignoring robots.txt, or scraping personal data can cross legal lines. Crawling for AI training is its own newer legal area still being litigated.
What is the fastest way to scrape a website?
Plain HTTP GET to the raw HTML, skip JavaScript execution, parse with cheerio or lxml, run 8 to 16 concurrent workers seeded from sitemap.xml. Roughly 50 to 200 ms per page. See our deep dive at /blog/fastest-way-for-ai-to-scrape-a-website.
What is the difference between scraping and crawling?
Crawling is the act of discovering URLs (typically by following links and reading sitemap.xml). Scraping is the act of extracting data from a specific URL. A real crawler does both: discover URLs, then scrape each one.
How do I block AI crawlers if I want to?
Add explicit disallow rules in robots.txt for GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, CCBot, and similar agents. We recommend the opposite for most local businesses: explicitly allow them, so your business gets cited in AI answers.
Related