🎉 Limited time — 20% off all plans. View pricing →
← All posts 2026-05-16 13 min read

How to extract structured data from articles in 2026

Pull clean article bodies, JSON-LD, OpenGraph, Twitter Cards, and reading-time metadata from any news or blog page — the modern alternative to building a Readability fork.

recipesarticle-extractionreadabilitystructured-data

The short answer: call /v1/extract/clean for the body and /v1/extract/structured for the metadata, and you're done

Article extraction in 2026 is a solved problem if you let it be one. The two endpoints together return everything a content app, a RAG pipeline, or a research tool needs: a clean body in markdown, the title, the byline, the publish date, reading time, word count, language, the top image, and every piece of structured metadata the publisher embedded. One HTTP call per shape.

The thing you used to build with a Readability fork and a stack of regex is now a typed JSON response. The rest of this page is the tour.

The problem

Article pages are noisy. A typical news page is 80 percent navigation, ads, related-articles rails, comment widgets, newsletter modals, and tracking scripts. The actual article is somewhere in there, wrapped in three nested divs whose class names change every quarter when the publisher's CMS team ships a redesign.

For years the standard answer was to vendor Mozilla's Readability library into your codebase and hope it kept up with the modern web. It mostly did. When it didn't, you wrote a per-publisher override and added it to a growing pile of selectors that someone on your team had to maintain.

Meanwhile, every publisher embeds schema.org Article JSON-LD in their pages because Google's news ranking rewards it. The structured data you actually need is already on the page. You just have to parse it reliably across thousands of templates, which is its own kind of tedium.

What the top alternatives offer

Mercury Parser, originally from Postlight and now community-maintained, set the bar for hosted article extraction. The original API made content apps possible for a generation of developers, and the open-source successor is still a fine library if you want to run extraction in-process. The Mercury team deserves credit for shaping what a clean extraction response should look like.

Mozilla Readability is the canonical open-source library, and it powers Firefox's Reader View. If your stack is JavaScript and you want zero network hops, Readability is the right call. The maintainers have done years of careful work on edge cases and the code is small enough to read in an afternoon.

Diffbot built a rich extraction product with strong entity-recognition and knowledge-graph layers on top of the basic article shape. For research and intelligence workloads that need entities, sentiment, and topic tagging out of the box, Diffbot is one of the most mature options in the market. Their Article API has been a reference implementation for over a decade.

Firecrawl and Apify are also doing excellent work here. Firecrawl's markdown-first approach is especially well-suited to LLM workflows, and Apify's actor model lets you compose extraction with downstream processing. All of these teams ship quality.

Where Qcrawl goes further

Qcrawl treats article extraction as two complementary calls. /v1/extract/clean handles the Readability-style body and the reading metadata. /v1/extract/structured handles every form of embedded metadata the page exposes: JSON-LD, microdata, OpenGraph, and Twitter Card. You call one, the other, or both, and you compose the results the way your app needs.

Both endpoints return typed JSON with stable field names. No HTML to parse. No optional fields that sometimes show up and sometimes don't. If a field isn't present on the source page, it comes back as null, not missing.

And for teams building LLM pipelines, /v1/actors/markdown wraps the whole thing in a single call that returns a model-ready document. One string in, one string out, ready to embed or summarize.

The step-by-step

Step 1 — Clean extraction for the body

Start with /v1/extract/clean. Pass a URL. Get back a structured object with the cleaned article HTML, the plain-text version, the title, the word count, and reading time in minutes.

{`curl -X POST https://api.qcrawl.com/v1/extract/clean \\
  -H "Authorization: Bearer osk_xxxxxxxxxxxxxxxxx" \\
  -H "Content-Type: application/json" \\
  -d '{
    "url": "https://www.example-news.com/2026/05/15/ai-policy-update.html",
    "format": "markdown"
  }'`}

The response looks like this. Note the consistent shape: every field is present, the body is clean markdown, and the metadata is flat and typed.

{`{
  "status": "success",
  "url": "https://www.example-news.com/2026/05/15/ai-policy-update.html",
  "title": "New AI policy update reshapes enterprise procurement",
  "short_title": "New AI policy update reshapes enterprise procurement",
  "content_html": "

The announcement on Tuesday marks a turning point...

", "content_text": "The announcement on Tuesday marks a turning point for how large organizations evaluate AI vendors...", "word_count": 1420, "reading_time_min": 6, "fetch_time_ms": 820 }`}

Step 2 — Structured extraction for the metadata

Run /v1/extract/structured in parallel when you need the publisher's embedded metadata. It returns four arrays and objects, one per supported format. Everything is parsed, validated, and ready to use.

{`curl -X POST https://api.qcrawl.com/v1/extract/structured \\
  -H "Authorization: Bearer osk_xxxxxxxxxxxxxxxxx" \\
  -H "Content-Type: application/json" \\
  -d '{
    "url": "https://www.example-news.com/2026/05/15/ai-policy-update.html"
  }'`}

The response gives you everything the page exposed.

{`{
  "json_ld": [
    {
      "@context": "https://schema.org",
      "@type": "NewsArticle",
      "headline": "New AI policy update reshapes enterprise procurement",
      "datePublished": "2026-05-15T08:30:00Z",
      "author": { "@type": "Person", "name": "Jordan Reyes" },
      "publisher": { "@type": "Organization", "name": "Example News" }
    }
  ],
  "microdata": [],
  "open_graph": {
    "title": "New AI policy update reshapes enterprise procurement",
    "type": "article",
    "url": "https://www.example-news.com/2026/05/15/ai-policy-update.html",
    "image": "https://www.example-news.com/images/ai-policy-hero.jpg",
    "site_name": "Example News",
    "description": "The announcement on Tuesday marks a turning point..."
  },
  "twitter_card": {
    "card": "summary_large_image",
    "site": "@examplenews",
    "title": "New AI policy update reshapes enterprise procurement",
    "image": "https://www.example-news.com/images/ai-policy-hero.jpg"
  },
  "rdfa": [],
  "meta": {
    "title": "New AI policy update reshapes enterprise procurement",
    "description": "The announcement on Tuesday marks a turning point...",
    "canonical": "https://www.example-news.com/2026/05/15/ai-policy-update.html"
  },
  "products": []
}`}

Both calls together cost roughly the same as one rendered scrape and give you a complete picture of the article in under a second.

Step 3 — One-call markdown for LLM pipelines

If your downstream is a model, the cleanest path is /v1/actors/markdown. It composes the clean and structured extractors and returns a single markdown document with the title, byline, date, and body laid out for retrieval.

{`curl -X POST https://api.qcrawl.com/v1/actors/markdown \\
  -H "Authorization: Bearer osk_xxxxxxxxxxxxxxxxx" \\
  -H "Content-Type: application/json" \\
  -d '{
    "url": "https://www.example-news.com/2026/05/15/ai-policy-update.html"
  }'`}
{`{
  "ok": true,
  "markdown": "---\\ntitle: New AI policy update reshapes enterprise procurement\\nauthor: Jordan Reyes\\npublished_at: 2026-05-15T08:30:00Z\\nsource: https://www.example-news.com/2026/05/15/ai-policy-update.html\\n---\\n\\nThe announcement on Tuesday marks a turning point for how large organizations evaluate AI vendors...",
  "tokens_estimate": 1820,
  "took_ms": 950
}`}

The front matter is YAML so most retrieval frameworks parse it for free. The token estimate is conservative and lets you make routing decisions before you call your model.

Step 4 — Batch a list of articles

For ingestion jobs, pass an array of URLs to /v1/scrape/batch with the extraction shape you want. Qcrawl fans out the work, returns a job ID immediately, and posts the results to your webhook when each one finishes.

{`curl -X POST https://api.qcrawl.com/v1/scrape/batch \\
  -H "Authorization: Bearer osk_xxxxxxxxxxxxxxxxx" \\
  -H "Content-Type: application/json" \\
  -d '{
    "mode": "extract_clean",
    "urls": [
      "https://www.example-news.com/2026/05/15/ai-policy-update.html",
      "https://www.another-publisher.com/2026/05/14/cloud-pricing.html",
      "https://www.blog-site.com/posts/the-state-of-rag.html"
    ],
    "webhook_url": "https://yourapp.com/hooks/extraction"
  }'`}

The async variant /v1/scrape/async works the same way for single-URL fire-and-forget jobs that you'd rather poll than block on.

Step 5 — Handle multi-page articles

For long-form pieces that span numbered pages, add paginate: true to the clean extract call. Qcrawl follows the rel-next chain and returns the concatenated body with each page's contribution preserved in order. Word count and reading time reflect the full piece.

{`curl -X POST https://api.qcrawl.com/v1/extract/clean \\
  -H "Authorization: Bearer osk_xxxxxxxxxxxxxxxxx" \\
  -H "Content-Type: application/json" \\
  -d '{
    "url": "https://www.longform-site.com/the-essay-part-1.html",
    "paginate": true,
    "format": "markdown"
  }'`}

A realistic scenario

Take a media-monitoring startup. Call the founder Sam. The product surfaces relevant news to enterprise communications teams every morning. Sam's team was ingesting 40,000 articles a day through a homegrown extractor built on top of Readability and a queue of per-publisher overrides. Three engineers spent a third of their time maintaining selectors.

They moved to Qcrawl over a sprint. The ingestion job is now one call to /v1/scrape/batch with mode: "extract_clean", and the per-publisher overrides are gone. The structured extractor picks up JSON-LD and OpenGraph for the metadata sidebar. Total maintenance time on the extractor in the six months since: zero hours.

The interesting part is what those three engineers are doing now. They built the entity-tagging layer their customers had been asking for. The extractor was never the product. It was a chore that pretended to be the product.

Pricing math

A clean extraction call sits in the rendered-scrape price bucket: a few cents at small volumes, well under a cent at scale. Structured extraction is cheaper because it skips the body-cleaning pass. Markdown actor calls are priced as a single composed call.

A media-monitoring pipeline ingesting 50,000 articles a day lands comfortably on the Business plan with room for growth. A team running a single content-research app at a few hundred articles a day fits cleanly on Starter. The pricing page has the live tiers.

Pricing across serious vendors in this space clusters in the same range because the underlying work is similar. Pick on output quality and developer experience rather than headline rates.

Composing extraction with scraping

Sometimes the article lives behind a click or a "load more" button. In that case, use /v1/scrape with an actions array to navigate to the rendered article URL, then pass the resulting HTML to the extraction endpoint. The two endpoints compose naturally because both speak the same content shape.

For sites that publish article lists on an index page, use the crawl endpoint to walk the index, then fan out to extract on each discovered URL. The pattern is the same one used by most large-scale news ingestion pipelines and it's roughly five lines of code on top of Qcrawl.

When the source requires authentication, see the companion recipe on scraping pages behind a login, which covers session capture and persistent browser sessions.

Operational notes

Three things separate good extraction pipelines from fragile ones. First, normalize URLs before you call. Strip tracking parameters, follow obvious shortener domains, and use the canonical URL from the response when you store the record. This avoids ingesting the same article eight times under eight different query strings.

Second, treat published_at as best-effort. Publishers are inconsistent about updating it on edits, and JSON-LD sometimes lags the visible page. If freshness matters, cross-check the JSON-LD datePublished against the OpenGraph article:published_time and store both.

Third, store the source HTML hash alongside the extracted record. When a publisher updates an article and your downstream consumers want to know about it, the hash gives you a cheap diff signal. Qcrawl returns a content hash in every response for this purpose.

When to use which endpoint

Use /v1/extract/clean when you need the body as readable text or markdown. Use /v1/extract/structured when you need the publisher's metadata. Use both when you need a complete record. Use /v1/actors/markdown when the consumer is a model.

Use /v1/scrape with extraction as a post-processing step only when the article isn't directly accessible at a stable URL. The extraction endpoints fetch the page themselves and you almost never need to scrape and extract separately.

All of these endpoints share the same authentication, the same rate limits, and the same observability surface. Mixing them in a single pipeline is not a decision; it's how the platform is designed to be used.

Cross-references and further reading

For the underlying scrape API, see the scrape page. For scheduled ingestion, see Qcrawl Automation. For full reference docs, the documentation covers every parameter and response field. Pricing is on the pricing page.

External background reading: schema.org Article defines the JSON-LD types this guide assumes, and the Wikipedia overview of web scraping covers the broader landscape. Both are useful primers if you're new to the space.

For a related recipe, see how to scrape pages behind a login, and for an industry deep-dive, web data for LLM training covers the upstream pipeline that ends in extraction calls like the ones above.

The shape of a clean extraction response, in detail

A typed response is only useful if you know what each field means and when to trust it. The title field is the publisher-canonical headline, taken from the JSON-LD headline property when present and falling back to the OpenGraph title and then the HTML title tag. It's the field you display.

The body field is the cleaned article text in the format you requested. Markdown is the default and the recommended choice for almost every downstream use. HTML is available when you need to preserve inline structure for a reader app. Plain text strips everything and is occasionally useful for keyword indexing.

The reading_time_minutes field is calculated from the cleaned body word count, not the raw HTML, which means it accurately reflects time-to-read for the article itself and ignores navigation and ads. The word_count is the same denominator and is exposed separately so you can apply your own reading-speed model if 240 words per minute isn't right for your audience.

The language field is an ISO 639-1 two-letter code, useful for routing decisions in multilingual pipelines. The top_image field is the publisher-designated hero image, taken from OpenGraph when present and from the first significant image in the body otherwise. The published_at field is an ISO 8601 timestamp, normalized to UTC.

The shape of a structured extraction response, in detail

The json_ld field is an array because publishers routinely embed multiple JSON-LD blocks per page: one for the article, one for the publisher organization, one for the breadcrumb trail, and sometimes one for a video or audio embed. Each entry is parsed and validated, so you can iterate the array and dispatch on @type without writing your own JSON-LD parser.

The opengraph field is a flat object with the canonical Open Graph properties: title, type, url, image, site_name, description, and the article-specific extensions when present. Missing properties come back as null, not absent, so your code doesn't need defensive presence checks.

The twitter_card field follows the same pattern with the Twitter Card meta tag namespace. The microdata field is an array of typed nodes for the small number of sites that still use HTML microdata instead of JSON-LD. Both are normalized to the same shape so you can treat them uniformly.

One small but useful detail: the structured extractor preserves the order of JSON-LD blocks as they appear in the source HTML, which matters for sites that put their primary article block first and supplementary blocks after. Order-dependent consumers can rely on this.

Composing extraction with downstream tasks

The most common downstream after extraction is summarization or embedding. Both work well with the markdown body and benefit from the typed metadata. A summarization prompt that includes the title, the publisher, and the publish date produces noticeably better summaries than one that just receives the body. A retrieval system that indexes the body but stores the metadata as filter columns gives users much better search than one that flattens everything.

The second most common downstream is classification and tagging. Reading time is a surprisingly good prior for content type: very short articles are usually news, medium articles are usually features, and long articles are usually essays or analysis. Language is the trivial first filter for multilingual pipelines. Word count is useful as a quality signal for spam detection.

The third common downstream is republication or syndication, which is where the structured extractor earns its keep. The JSON-LD output gives you exactly what you need to rebuild the article's canonical metadata in your own CMS, including the original author and publisher attribution. This matters for legal compliance and for trust with the source.

Handling weird article pages

Some article pages don't fit the standard mold. Live blogs append new content to the same URL over hours or days. Slideshow articles split each "page" into a click. Q&A interviews use unusual heading structures. The extractor handles most of these by default, and the ones it can't are usually one parameter away.

For live blogs, pass strategy: "live_blog" and the extractor preserves the timestamped entries as a list rather than flattening them into prose. For slideshow articles, pass paginate: true with strategy: "slideshow" and you get the concatenated body with each slide labeled. For Q&A formats, the default extractor preserves the speaker labels because they're encoded in the headings.

The general principle is that the extractor tries to give you back what the publisher meant, not just what the HTML contained. When publishers use semantic markup, that's easy. When they don't, the strategy parameters give you escape hatches for the common patterns.

A short note on language coverage

The clean extractor works on any language with standard left-to-right or right-to-left text, because the body-detection heuristics are structural rather than lexical. We've validated extraction across major European, East Asian, and Middle Eastern languages with consistent quality. Less-represented languages work equally well from a structural perspective; the only caveat is that language detection becomes less confident at the long tail.

Reading time uses the 240-words-per-minute baseline universally, which slightly under-reports for character-based languages like Chinese and Japanese where the equivalent measure is closer to characters per minute. If reading time matters for your product in those languages, divide the character count by 500 as a reasonable approximation.

Closing

Article extraction used to be a research project with a long tail of edge cases. In 2026 it's three endpoints and a webhook, with typed responses and sub-second latency. Build the thing your users actually care about on top of it.

If you want to try the calls in this guide against your own URLs, grab a key and run the clean extraction example at the top of this page. The output is typed, the metadata is rich, and the first thousand calls are on us while you evaluate.

Common questions

What's the difference between Readability and Mozilla Readability?
Readability is the historical name for the algorithm that extracts the main body of an article from a noisy web page. Mozilla Readability is the open-source library Firefox uses for Reader View, derived from the original Readability project. Both refer to the same family of techniques. Qcrawl's clean extraction sits in this lineage.
How do I get clean markdown from a news article?
Call /v1/extract/clean with the article URL and pass format set to markdown. You get back the title, the body as markdown, reading time, word count, language, and the top image. The output is ready to feed to a model or render in a reader app without further cleanup.
Does Qcrawl return JSON-LD as structured objects?
Yes. /v1/extract/structured parses every JSON-LD block on the page, validates it as JSON, and returns it as a typed array. OpenGraph and Twitter Card meta tags come back as flat objects with their canonical property names. Microdata is parsed into an array of typed nodes.
What languages does article extraction support?
Clean extraction works on any language with standard left-to-right or right-to-left text. The body extractor doesn't depend on language-specific dictionaries. Language detection runs after extraction and returns an ISO 639-1 code in the response, which is useful for downstream routing.
How accurate is reading time?
Reading time uses a 240-word-per-minute baseline, which is the standard for adult English-language reading on screens. The number is calculated from the cleaned body, not the raw HTML, so it ignores navigation chrome and footers. For other languages the same baseline is used, which slightly under-reports for dense languages.
Can Qcrawl extract paywalled article content?
No, and this is a deliberate choice. Qcrawl is built for legitimate access patterns: open articles, your own content, and pages where you have a license. Paywall circumvention isn't supported. There's plenty of valuable work to do on the open web.
What's the rate limit on extraction endpoints?
Extraction endpoints share the standard scrape rate limits, which scale with your plan tier. The Starter tier supports a small number of concurrent calls; Business and Enterprise tiers run hundreds of concurrent extractions with batch and async modes available for very large jobs.
How does clean extraction handle multi-page articles?
If the article uses standard rel-next pagination or numbered pagination, you pass paginate set to true and Qcrawl follows the chain and returns the concatenated body. For articles using infinite-scroll patterns, use the scrape endpoint with a scroll action and pipe the result through extract.
Can I get LLM-ready output in one call?
Yes. /v1/actors/markdown returns a single markdown document with the title, byline, publish date, and body in a layout designed for retrieval and summarization. It's the right endpoint when you're building a RAG pipeline or a summarization tool and want one clean string per article.

Start pulling clean data in minutes.

1,000 requests free every month. No credit card required.