How to scrape Zillow listings at scale in 2026
The honest guide to extracting Zestimate, price, beds, baths, and lot details from Zillow — what works, what fails, and how proptech teams ship in production.
What scraping Zillow at scale really means in 2026
Scraping Zillow at scale in 2026 means extracting structured property data — address, price, Zestimate, beds, baths, square feet, year built — from public listing pages on a schedule, through a pipeline that survives Zillow's aggressive bot defenses. The practical path for production teams is a managed actor API that returns clean JSON per address, not a homegrown headless-browser farm.
That's the snippet the search engines and the LLMs will quote. The rest of this page is for the proptech founders, real estate analytics leads, and engineering heads who have to actually ship this. If you want a deeper dive on the defense side of the problem, our blog post Scraping Zillow in 2026: what works, what fails, what to do about it is the long-form companion.
The problem you are actually solving
You don't want to scrape Zillow. You want the data. Specifically, you want a clean, deduplicated, time-stamped property record per address — for off-market deal-flow scoring, for investor underwriting, for proptech analytics, for a CMA tool, or for an AI agent that answers buyer questions.
The dataset is irreplaceable. Zillow tracks over 110 million properties in the United States. There is no sanctioned API path to that data for new developers. Which means every serious proptech team is scraping, licensing third-party datasets, or partnering with MLS systems for the segments where that's feasible.
The technical question of how to get past Zillow's bot defenses is downstream of your real question, which is how to ship a product against this dataset without losing six months to infrastructure work. This recipe answers the second question. The first one is handled for you.
What the leading alternatives offer
Zillow scraping is a mature category with credible vendors. Your evaluation shortlist probably includes some combination of the following.
Bright Data
Bright Data offers both raw proxy infrastructure and pre-built Zillow datasets. Their residential proxy network is one of the largest available, and their compliance posture clears enterprise procurement gates that smaller vendors don't. For teams that want collection capability and dataset products under one MSA, Bright Data is the obvious enterprise shortlist entry.
Apify
Apify hosts a public actor marketplace with several maintained Zillow scrapers, plus the platform infrastructure for you to write and host your own. If you have an engineer who enjoys writing scrapers in TypeScript and you want a managed runtime for that work, Apify is a flexible and well-documented choice. Their community is large and their support is responsive.
Oxylabs
Oxylabs brings serious proxy infrastructure plus a dedicated real estate scraper API for Zillow and Redfin. Their data quality is solid and their compliance and security certifications are mature. Enterprise legal teams take them seriously, and rightly so. For European teams in particular, their GDPR posture is among the strongest in the category.
Where Qcrawl goes further
The Zillow actor in Qcrawl is built specifically for proptech teams that want to ship a product, not maintain a pipeline. Three concrete outcomes set it apart.
First, direct payload extraction. Zillow embeds the full property data as JSON in a script tag on each property page. Our actor parses that payload directly rather than scraping the rendered DOM. The fields stay stable across Zillow's UI changes — when Zillow ships a redesign, your pipeline does not break.
Second, transparent failure handling. When Zillow's bot defense challenges a request, our actor returns a structured error explaining what happened, rather than silently writing a challenge page into your database. That lets your pipeline make an informed retry decision and keeps your data clean. We absorb the retry on our side when we can route the request through a path with a real chance of success.
Third, predictable per-request pricing. No proxy surcharge, no concurrency tier, no minimum monthly commitment. Pricing scales linearly with what you actually pull. For the proptech team building a CMA tool or a deal-flow scorer, that translates into a budget you can defend to your CFO.
Where Bright Data is the heavyweight, Qcrawl is the developer-velocity option. Where Apify gives you a marketplace, Qcrawl gives you a single actor that handles Zillow correctly out of the box. Where Oxylabs wins on enterprise certifications, Qcrawl wins on time-to-first-record — under five minutes from signup to a real address.
The recipe, step by step
Five steps from zero to a production Zillow pipeline.
Step 1. Get an API key
Sign up at qcrawl.com/pricing, copy your API key, and export it. Keys are prefixed with osk_.
export DATASONAR_KEY="osk_xxxxxxxxxxxx" Step 2. Pull a single property
Confirm the pipeline with one address before scaling. The Zillow actor accepts a full property URL — the kind that ends in a zpid.
curl -X POST https://api.qcrawl.com/v1/actors/zillow \
-H "Authorization: Bearer $DATASONAR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.zillow.com/homedetails/123-Main-St-Austin-TX-78701/12345678_zpid/"
}' {
"price": 685000,
"zestimate": 692400,
"address": "123 Main St, Austin, TX 78701",
"city": "Austin",
"state": "TX",
"zipcode": "78701",
"bedrooms": 3,
"bathrooms": 2,
"living_area_sqft": 1840,
"year_built": 1962,
"raw_payload_extracted": true
} Core property fields are returned cleanly. The raw_payload_extracted flag signals that the actor parsed the embedded Next.js data payload directly — that's the reliable extraction path, more stable across UI redesigns than DOM scraping. Additional fields like lot size and listing status are available on request for Business and Enterprise customers.
Step 3. Scale up with batch
For a real pipeline — hundreds or thousands of properties — use the batch endpoint. Up to 100 URLs per call, run in parallel on our side.
curl -X POST https://api.qcrawl.com/v1/scrape/batch \
-H "Authorization: Bearer $DATASONAR_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://www.zillow.com/homedetails/123-Main-St-Austin-TX-78701/12345678_zpid/",
"https://www.zillow.com/homedetails/456-Oak-Ave-Austin-TX-78702/23456789_zpid/",
"https://www.zillow.com/homedetails/789-Pine-Rd-Austin-TX-78703/34567890_zpid/"
],
"format": "json",
"concurrency": 10
}' The scrape/batch endpoint with format: "json" fetches each URL in parallel and returns lean per-URL records (url, title, eval, time_ms, worker). For the structured Zillow fields, fan out per-URL calls to /v1/actors/zillow from your worker pool — the actor parses the embedded Next.js payload server-side. A 50-to-100 concurrent worker pool handles a typical region refresh in minutes.
Step 4. Async with webhook delivery for catalog-scale jobs
Once you cross a few thousand properties in a single job, switch to async. Submit one URL per job, receive results via webhook when each completes.
curl -X POST https://api.qcrawl.com/v1/scrape/async \
-H "Authorization: Bearer $DATASONAR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.zillow.com/homedetails/123-Main-St-Austin-TX-78701/12345678_zpid/",
"webhook_url": "https://your-app.example.com/hooks/zillow"
}' The endpoint returns a job ID immediately. You can poll GET /v1/jobs/{id} if you prefer pull, but the webhook path is the production pattern. Most teams use the polling endpoint only for debugging.
Step 5. Land it in your warehouse
The last step decides whether your pipeline pays for itself. Land each record with an extracted_at timestamp and the source URL. Keep the raw response in a JSON column alongside the flattened fields so you can re-extract later without re-scraping.
Add a uniqueness constraint on (zpid, extracted_at) and you have a clean time-series for price-change monitoring. Most proptech teams build their first internal tool on top of this table inside a week.
The fields, and what each one tells you
Eight fields drive the majority of proptech decisions. It's worth being explicit about what each one means and where the analytic value lives.
Price is the asking price. It reflects the seller's expectations and the agent's market read. Tracked over time, it tells you about local market momentum — drops indicate softening demand, raises indicate confidence.
Zestimate is Zillow's algorithmic valuation. It is not a price, but it is a useful baseline. The delta between price and Zestimate is often more interesting than either number alone — a listing priced 15 percent above Zestimate in a softening market is a different signal than one priced 5 percent below in a hot market.
Bedrooms and bathrooms are the comparable-sale axis. Almost every CMA model normalizes by bed and bath count. Track both, even when one feels redundant.
Living area is the price-per-square-foot denominator. The single most useful normalized metric in residential real estate analytics.
Lot size matters for single-family and for any analysis involving redevelopment potential. Tear-down investors care about lot size more than living area.
Year built is the proxy for capital expenditure risk. Pre-1980 homes have different rehab profiles than post-2000 homes. Investor models weight this heavily.
Listing status is the state machine field. FOR_SALE, SOLD, OFF_MARKET, and PENDING each mean different things downstream. Always extract it, always store it, always filter on it.
A realistic scenario
An off-market deal-flow startup we work with builds an investor-facing scoring tool that flags undervalued single-family homes in three metro markets. Their previous pipeline was a headless browser farm running on rotating residential proxies, maintained by one full-time engineer who spent roughly half his week firefighting.
The team tracks roughly 38,000 active listings refreshed nightly, plus a long tail of off-market addresses pulled on demand when investors request them. Total monthly volume runs around 1.4 million property pulls.
After the switch, the engineer got his week back. The pipeline runs as a nightly async job against the active listing set, with on-demand pulls routed through the synchronous endpoint when investors trigger a lookup in the app. Total monthly Qcrawl spend dropped meaningfully against the loaded cost of the previous setup once proxy spend, infrastructure, and engineering time were counted. The investor-facing dashboard now shows price changes within hours of Zillow updating them rather than the next day.
The pricing math
Let's run the numbers honestly. A serious in-house Zillow pipeline at 100,000 properties a month carries three significant cost lines: residential proxy spend, browser infrastructure, and engineering attention. Each line is provider- and team-specific, but loaded together the monthly total is rarely small.
A homegrown pipeline at 100k pulls a month carries a loaded monthly cost that surprises most teams when they tally everything honestly, plus the calendar months lost to building it. The same volume on Qcrawl runs at the per-request rates on the pricing page. Most Zillow pipelines below 100k requests a month land cheaper on a managed API than building the equivalent in-house. Above a million requests a month, the calculus is worth a procurement-grade conversation. See qcrawl.com/pricing for volume rates.
What can go wrong
Even with a managed API, a few failure modes are worth planning for.
Off-market and recently-sold properties sometimes return partial data. The price field may be the last sale price rather than a current asking price; the Zestimate is always current. Tag your records with listing_status and handle the three main states — FOR_SALE, SOLD, and OFF_MARKET — distinctly in downstream logic.
Multi-unit and condo listings sometimes return the building-level record rather than the unit-level record. If your use case requires unit-level fidelity, paste the unit-specific URL rather than the building landing page. The actor honors whichever URL you submit.
Zillow occasionally returns a soft challenge even for well-behaved residential traffic. Our actor absorbs the retry on our side within the timeout window. If a request still fails, the response includes a structured error code your pipeline can act on — typically a transient block that resolves on retry. Treat it like any other transient API failure.
For broader context on the legal posture of public-web data collection, the long-running US case law summarized at en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn is the foundational reading. The general direction is favorable for public-page observation, but commercial deployment warrants a conversation with counsel.
Pairing the Zillow actor with the rest of your pipeline
Zillow data alone is powerful. Zillow data combined with other public signals is significantly more so. Most proptech teams pair the Zillow actor with a local business feed from the Google Maps actor for neighborhood context, with the generic scrape endpoint for county tax-assessor pages, and occasionally with competitor pricing monitoring for teams that also operate a brokerage or rental management business.
The pipeline gets more useful the more sources you fold in. Zillow alone tells you a listing. Zillow plus county records plus neighborhood data tells you whether the listing is worth underwriting.
The two-stage pipeline pattern
Most production Zillow pipelines split the work into two stages with different cadences and different cost profiles.
Stage one is discovery. You start with a geographic input — a ZIP code, a city, a county — and produce a list of property URLs to extract. Discovery typically runs on a slower cadence, often weekly, because the universe of listings in a given geography doesn't change dramatically day to day.
Stage two is extraction. You take the URL list from stage one and run each property through the Zillow actor. Extraction runs on whatever cadence your use case demands — nightly for active monitoring, on-demand for investor lookups, one-time for an initial backfill.
Separating the two stages keeps your costs predictable. Discovery is the cheaper operation per call but the higher volume one. Extraction is more expensive per call but lower volume. Treating them as one big pipeline obscures that economics and usually leads to over-polling on discovery.
What proptech teams actually do with the data
Three use cases account for roughly 80 percent of the Zillow extraction we see across customers.
Deal-flow scoring. Investors and off-market specialists score every active listing in their target metros against an underwriting model — cap rate, rehab potential, neighborhood signals, days on market. The scoring runs nightly against fresh data and surfaces a ranked list each morning. The data layer is the Zillow actor plus a county tax-assessor scrape plus a neighborhood demographic feed. The decision layer is whatever model the firm has trained.
Comparative market analysis. Agents and small brokerages need to produce a CMA for a seller in under an hour. The traditional path is hours of manual MLS work. With a Zillow data feed plus a few proprietary signals, that becomes a one-click report. The brokerages building these tools are the most pragmatic customers of the Zillow actor — they don't care how the data arrives, they care that it arrives reliably and correctly attributed.
Consumer-facing search tools. Newer proptech entrants build search experiences competing directly with Zillow and Redfin's own UIs, typically focused on a niche — investor-only properties, rent-to-own, specific architectural styles, sustainable homes. The data backing these tools comes from a combination of MLS partnerships where available and Zillow extraction where not.
How to think about data freshness
Freshness is a function of the use case, not the technology. For an investor lookup, the data needs to be fresh at the moment of the lookup — the synchronous endpoint with a sub-second response is the right path. For an overnight scoring model, fresh-as-of-midnight is fine — the async endpoint with webhook delivery handles this with no overhead.
For price-change monitoring on a specific watchlist, hourly polling is the typical cadence. Zillow itself doesn't update prices in real time — listing agents update them on whatever schedule suits them. An hourly poll catches changes within a useful window without generating volume the use case doesn't warrant.
The mistake we see most often is over-polling. Teams set up sub-hourly refresh on the full 110-million-property universe and end up with a massive bill and no meaningful data quality improvement. Pick the cadence the decision actually needs.
What to do next
Pick five addresses you already know well. Sign up, paste the curl from Step 2, and confirm the actor returns clean data for each. Then expand to your full set, wire the async webhook in week two, and have your first proptech tool in front of users inside a month.
If your use case has a wrinkle — multi-family-only coverage, an unusual geography, a need for fields beyond the default set — send us a note. The proptech use case is one of our most common conversations and we've probably seen the version of your problem you're worried about. Read the docs, explore the actor catalog, and ship the tool your investors are waiting for.