The fastest way for an AI to scrape a website (and the best way)
Two different questions with two different answers. The fastest scraper is plain HTTP and cheerio, 50 to 200 ms per page. The best scraper does five more things and gets clean structured data the LLM can use directly. Real code, real numbers.
There are two versions of this question and they have different answers.
"Fastest" means raw speed: how few milliseconds can it take to read a page and pull the text out. "Best" means the highest-quality result the LLM can actually use, which is usually a bit slower but much more useful.
Most operators ask the first question. The honest answer is that the fastest way is good enough for 95% of real work, and the "best" way matters only for adversarial sites, single-page applications, and clean structured extraction.
This is the actual answer, with code.
The fastest scraper, in detail
Plain HTTP. No browser. No JavaScript execution. Parse the raw HTML with a lightweight library.
import { request } from 'undici';
import * as cheerio from 'cheerio';
async function scrapeFast(url: string) {
const { body } = await request(url, {
headers: { 'user-agent': 'YourBot/1.0 (https://yourbot.com)' },
bodyTimeout: 5000,
});
const html = await body.text();
const $ = cheerio.load(html);
return {
title: $('h1').first().text().trim() || $('title').text().trim(),
description: $('meta[name="description"]').attr('content') ?? '',
bodyText: $('main, article, body').first().text().replace(/\s+/g, ' ').trim(),
jsonLd: $('script[type="application/ld+json"]')
.map((_, el) => $(el).text())
.get()
.map((s) => safeJsonParse(s)),
};
}
function safeJsonParse(s: string) {
try { return JSON.parse(s); } catch { return null; }
}
Per-page time on a normal blog post: 50 to 200 ms. Most of that is the network round trip. Cheerio's parsing is single-digit milliseconds.
For crawling a whole site, the move is to read sitemap.xml first and process URLs in parallel with a bounded worker pool:
import pLimit from 'p-limit';
async function crawlSite(base: string) {
const sitemap = await fetchSitemap(`${base}/sitemap.xml`);
const urls = sitemap.map((entry) => entry.loc);
// 8 to 16 concurrent workers is the sweet spot before sites rate-limit.
const limit = pLimit(12);
const pages = await Promise.all(
urls.map((url) => limit(() => scrapeFast(url).catch(() => null))),
);
return pages.filter(Boolean);
}
A 200-page site finishes in 8 to 20 seconds with this approach.
Why this works (and what skips don't)
Three things you are deliberately skipping for speed:
JavaScript execution. A headless browser (Playwright, Puppeteer) renders the page like a real user, but it costs 1 to 3 seconds of overhead per page and uses 200 to 500 MB of memory. For most sites, the actual content is in the static HTML anyway. Skipping JS is a 20 to 50x speedup.
Image and asset loading. A browser pulls the CSS, the fonts, the images, the analytics scripts. You don't need any of that to read the text. HTTP-only is two orders of magnitude lighter.
Interaction. A browser can click, scroll, hover. A scraper just grabs the markup. Almost no marketing or blog content needs interaction to surface.
For a static site (Next.js, Astro, Hugo, plain HTML, properly server-rendered Wordpress), this works. For an SPA where the content is rendered only after JavaScript runs, the static HTML is empty and you have to fall back to Playwright.
When to escalate to a headless browser
Three cases, in order of how often they come up:
SPAs with client-side data fetching. Twitter, LinkedIn, modern React/Vue apps where the static HTML is
<div id="root"></div>. The text shows up only after JS runs. Use Playwright.Sites with Cloudflare or other anti-bot protection. A bare HTTP request gets challenged. Playwright with a real user-agent and reasonable delays often passes.
Pages that require interaction. Click a tab, scroll to load more, expand a section. Rare for marketing content but common for catalogs.
import { chromium } from 'playwright';
async function scrapeWithBrowser(url: string) {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle', timeout: 30000 });
const html = await page.content();
await browser.close();
return parseHtml(html);
}
Per-page time: 1 to 3 seconds. 20 to 50 times slower than plain HTTP. Use sparingly.
The "best" scraper, end to end
This is what we actually run inside Traccion for citation tracking and content analysis. Adds five steps to the fast scraper.
Step 1: Read llms.txt first
If the site has an llms.txt at the root, fetch it first. It is a plain-markdown summary of what the site is about, who the business is, what they charge. Free context that saves you parsing the whole site.
async function fetchLlmsTxt(base: string): Promise<string | null> {
try {
const { statusCode, body } = await request(`${base}/llms.txt`);
if (statusCode !== 200) return null;
return await body.text();
} catch {
return null;
}
}
About 8% of sites had an llms.txt in early 2026 and the number is climbing. Always check.
Step 2: Read sitemap.xml and respect robots.txt
The polite scraper checks robots.txt and skips disallowed paths. Most sites do not explicitly disallow general crawling, but some do, and ignoring it is both rude and often the trigger for an IP ban.
import robotsParser from 'robots-parser';
const robotsTxt = await fetchText(`${base}/robots.txt`);
const robots = robotsParser(`${base}/robots.txt`, robotsTxt);
const allowed = robots.isAllowed(url, 'YourBot/1.0');
if (!allowed) return null;
Step 3: Extract JSON-LD before parsing prose
Structured data is the ground truth. JSON-LD on a page tells you what kind of thing the page is, who the business is, what they sell, how much it costs. Extract this first and you often do not need to parse the prose at all.
const jsonLdBlocks = $('script[type="application/ld+json"]')
.map((_, el) => safeJsonParse($(el).text()))
.get()
.filter(Boolean);
const orgSchema = jsonLdBlocks.find((b) => b['@type'] === 'Organization');
const serviceSchema = jsonLdBlocks.find((b) => b['@type'] === 'Service');
A well-structured page hands you the business name, the phone, the address, the prices, the languages, the hours, in one or two JSON blocks. Saves enormous parsing time and avoids prose-extraction errors.
Step 4: Clean the HTML with readability
Strip the nav, the footer, the ads, the sidebars. What remains is the actual article body. Mozilla's @mozilla/readability library does this well and is what Firefox Reader Mode uses internally.
import { Readability } from '@mozilla/readability';
import { JSDOM } from 'jsdom';
function extractArticle(html: string, url: string) {
const dom = new JSDOM(html, { url });
const reader = new Readability(dom.window.document);
const article = reader.parse();
return article ? { title: article.title, content: article.content, textContent: article.textContent } : null;
}
Step 5: Convert HTML to markdown before sending to the LLM
LLMs read markdown more accurately than HTML, and markdown tokenizes to roughly 4x fewer tokens than the same HTML. Both matter when you are feeding scraped content into an LLM for analysis.
import TurndownService from 'turndown';
const turndown = new TurndownService({
headingStyle: 'atx',
codeBlockStyle: 'fenced',
});
const markdown = turndown.turndown(articleHtml);
For a 5000-word article, this changes 30,000 tokens of HTML into 7,500 tokens of markdown. Same information, four times cheaper.
Putting it all together
async function bestScrape(url: string) {
const u = new URL(url);
const base = `${u.protocol}//${u.host}`;
const [robotsAllowed, llmsTxt, html] = await Promise.all([
checkRobots(base, url),
fetchLlmsTxt(base),
fetchHtml(url),
]);
if (!robotsAllowed || !html) return null;
const $ = cheerio.load(html);
const jsonLd = extractJsonLd($);
const article = extractArticle(html, url);
const markdown = article ? htmlToMarkdown(article.content) : null;
return {
url,
jsonLd,
title: article?.title ?? $('title').text(),
markdown,
siteContext: llmsTxt,
};
}
This runs in roughly 100 to 400 ms per page for static sites. For SPAs, swap fetchHtml for the Playwright version and expect 1 to 3 seconds.
What real AI engines do
Most production LLM crawlers (GPTBot, ClaudeBot, PerplexityBot) use the fast path. They do not render JavaScript at crawl time. This is the entire reason llms.txt, robots.txt, and JSON-LD matter so much for getting cited.
If your site is JavaScript-only, AI crawlers fall back to rendering, but rendering is rate-limited heavily. You get crawled less often, less deeply, and your content shows up in AI answers less reliably.
The fix is to make your site easy to scrape: server-side rendering, JSON-LD on every page, a clean sitemap, an llms.txt. We do all of this at Traccion, which is why our own Visibility Score is improving as the crawlers index us.
A note on ethics
The legal floor on scraping public web content is more permissive than people assume. The ethical floor is to not be a jerk.
The actual rules:
- Identify your bot in the user-agent string. Include a contact URL.
- Respect robots.txt. If a site says do not crawl, do not crawl.
- Throttle. 8 to 16 concurrent workers is the sweet spot. Above that you start hurting smaller sites.
- Cache aggressively. Re-fetching the same page every hour is wasteful.
- Skip authentication walls. Public content only.
Build a scraper that you would not mind hitting your own site. That is the standard.
Further reading
- What is web scraping? — the operator-level explainer
- How to write an llms.txt for a local business
- How to get cited by ChatGPT, Claude, and Perplexity
- What is AEO?
If you want a scraper built into your own product (price monitoring, lead generation, content aggregation), we build these as Custom Software starting at $3,450.
30 minutes. No deck. Just the work.
We map your operations and hand you a ranked list of AI wins by ROI. Free. Book a consulting call →