Is it legal to scrape pages behind a login?

It depends on the site, your relationship to it, and your jurisdiction. The safest defaults are scraping your own accounts, sites you operate, or systems where you hold explicit written permission. Anything else is a counsel conversation. Document your basis before you ship.

How do sessions work in authenticated scraping?

A session is the state a site remembers about you between requests, usually a cookie or a token. For one-shot work, log in and extract in the same call. For multi-page workflows, hold the session in a remote browser so each subsequent call reuses the same cookies and storage.

Can I scrape my own SaaS dashboard for backups?

Yes, and this is one of the most common authenticated-scraping use cases. Teams pull invoices, analytics exports, and configuration snapshots from vendors that don't offer a clean API. You own the account, so you control the access. Check your vendor's terms first.

Does Qcrawl respect robots.txt?

Qcrawl reads robots.txt for unauthenticated public crawls and surfaces the directives to you. For authenticated flows, robots.txt is rarely the controlling document because logged-in pages are not part of the public crawl surface. The controlling document is the site's terms of service and your agreement with them.

What's the difference between per-request login and persistent session?

Per-request login runs the form fill, submit, and extraction inside a single API call. Persistent session keeps a browser warm across many calls so cookies, local storage, and CSRF tokens survive. Pick per-request for stateless jobs and persistent for multi-page workflows.

How do I handle two-factor authentication?

For accounts you own, use an app password or a long-lived session token captured once and reused. For TOTP, generate the code inside your pipeline and pass it as a typed action. SMS-based 2FA is not a good fit for automation and usually signals you should rethink the approach.

What happens when the login form changes?

Qcrawl's smart scrape mode uses semantic selectors that survive most redesigns. When a site ships a structural change, your job returns a clear error pointing at the missing element so you can update the recipe in one place. We monitor common targets and ship updates for our managed actors.

Can I use Qcrawl to bypass paywalls?

No. Qcrawl is built for legitimate access patterns: your own accounts, partner data, internal tools, and pages where you have authorization. Paywall circumvention isn't supported and isn't a use case we're interested in. There's plenty of honest work in this space.

How do I store credentials safely?

Use your own secret manager and pass credentials per request, not in the URL. Qcrawl accepts credentials in the request body over TLS and does not log them. For long-running sessions, store the session cookie in your vault and inject it into the remote browser at startup.

Is it legal to scrape pages behind a login?

It depends on the site, your relationship to it, and your jurisdiction. The safest defaults are scraping your own accounts, sites you operate, or systems where you hold explicit written permission. Anything else is a counsel conversation. Document your basis before you ship.

How do sessions work in authenticated scraping?

A session is the state a site remembers about you between requests, usually a cookie or a token. For one-shot work, log in and extract in the same call. For multi-page workflows, hold the session in a remote browser so each subsequent call reuses the same cookies and storage.

Can I scrape my own SaaS dashboard for backups?

Yes, and this is one of the most common authenticated-scraping use cases. Teams pull invoices, analytics exports, and configuration snapshots from vendors that don't offer a clean API. You own the account, so you control the access. Check your vendor's terms first.

Does Qcrawl respect robots.txt?

Qcrawl reads robots.txt for unauthenticated public crawls and surfaces the directives to you. For authenticated flows, robots.txt is rarely the controlling document because logged-in pages are not part of the public crawl surface. The controlling document is the site's terms of service and your agreement with them.

What's the difference between per-request login and persistent session?

Per-request login runs the form fill, submit, and extraction inside a single API call. Persistent session keeps a browser warm across many calls so cookies, local storage, and CSRF tokens survive. Pick per-request for stateless jobs and persistent for multi-page workflows.

How do I handle two-factor authentication?

For accounts you own, use an app password or a long-lived session token captured once and reused. For TOTP, generate the code inside your pipeline and pass it as a typed action. SMS-based 2FA is not a good fit for automation and usually signals you should rethink the approach.

What happens when the login form changes?

Qcrawl's smart scrape mode uses semantic selectors that survive most redesigns. When a site ships a structural change, your job returns a clear error pointing at the missing element so you can update the recipe in one place. We monitor common targets and ship updates for our managed actors.

Can I use Qcrawl to bypass paywalls?

No. Qcrawl is built for legitimate access patterns: your own accounts, partner data, internal tools, and pages where you have authorization. Paywall circumvention isn't supported and isn't a use case we're interested in. There's plenty of honest work in this space.

How do I store credentials safely?

Use your own secret manager and pass credentials per request, not in the URL. Qcrawl accepts credentials in the request body over TLS and does not log them. For long-running sessions, store the session cookie in your vault and inject it into the remote browser at startup.

← All posts • 2026-05-16 • 14 min read

How to scrape pages behind a login in 2026

A practical guide to authenticated scraping in 2026 — form-based logins, session-persistent flows, and the legal and operational guardrails every team needs.

recipesauthenticated-scrapingsessionsbrowser-automation

The short answer: scrape behind a login by automating the form once, then carrying the session forward

Authenticated scraping in 2026 is a two-shape problem. Either you log in and extract in the same call, or you hold a warm browser session and run many calls against it. Qcrawl covers both shapes with the actions array on /v1/scrape and a remote browser you can drive over the wire.

The work is mostly about choosing the right shape for the job, not fighting the login form. Once you pick correctly, the rest is plumbing.

The problem

Most useful data lives behind a credential. Your billing portal, your ad platform, your CRM, your fulfillment partner, your bank, your analytics vendor. None of them want to write a clean export API for the specific report you need every Monday morning.

So engineering teams end up writing brittle scripts that log in with Puppeteer, click around, copy a CSV out of a modal, and email it to themselves. Those scripts break every time the vendor moves a button. They run on someone's laptop. They fail silently the week the project lead is on vacation.

The modern answer is to treat authenticated extraction the same way you treat any other API call. You describe the steps. The platform runs them in a hardened browser. You get structured data back. When the form changes, you change one line.

A legal note up front, then we'll get to the code

Authenticated scraping is a sensitive area legally. Before deploying anything that scrapes a site you don't own or operate behind a credentialed flow, talk to your counsel. The reasonable default looks like this: only scrape your own accounts, sites where you have explicit permission, or partner systems with a contract that covers automated access.

Respect the site's terms of service. Document your basis. Keep an audit trail of what you accessed and why. None of this is legal advice and your situation is your own, and a five-minute call with your lawyer before you commit a credential to a pipeline is cheaper than the alternative.

With that on the record, the engineering part is fun.

What the top alternatives offer

Browserless has been the developer's favorite hosted Chrome for years, and the team has done excellent work on the WebSocket-driven CDP model. If you already write Puppeteer or Playwright code and want to keep that shape, Browserless gives you a clean remote endpoint that feels almost local. Their documentation on session reuse and queue management is some of the best in the category.

Apify built a marketplace of login actors that handle specific sites end-to-end. Their LinkedIn, Instagram, and ecommerce login actors are well-maintained and the actor-as-a-package model is genuinely clever. If your job maps to an existing actor, Apify saves you the work of writing one. Their proxy integration and dataset storage are mature.

ScrapingBee takes the simple-by-default approach with a single endpoint that accepts a JavaScript scenario for clicks and waits. It's a great fit for small teams who want one HTTP call and a clean JSON response, and their pricing is friendly to startups. The team is responsive and the docs are tight.

ZenRows and Bright Data Web Unlocker are also doing strong work on the protected-content side, and Bright Data's enterprise tooling around authenticated flows is comprehensive. We respect what all of these teams ship.

Where Qcrawl goes further

Qcrawl is built for engineering teams that want one platform across both shapes of authenticated scraping. The actions array on /v1/scrape handles per-request logins with a typed schema. The remote browser at /browser/ handles persistent sessions with a CDP-over-WebSocket endpoint. Same auth, same billing, same audit log.

The platform returns clean structured output by default. You ask for format: "markdown" and you get reading-ready text. You ask for format: "links" and you get a deduplicated array. The same call that logs in is the call that gives you the data, and the data is shaped for downstream use without a post-processing layer.

And because the engine is the same one powering Qcrawl's automation workflows, you can graduate from a one-off recipe to a scheduled job without rewriting anything. The recipe is the workflow.

The step-by-step

Step 1 — Decide which shape you need

Two questions. Does the job finish in one logical sequence (log in, grab data, done)? Or does it span many requests over minutes or hours, with cookies that have to survive between them? The first is per-request. The second is persistent session.

Most reporting and backup jobs are per-request. Anything that involves pagination through hundreds of pages, file uploads, or multi-step form wizards is a persistent session.

Step 2 — Per-request login with the actions array

The actions array on /v1/scrape accepts three primitives: click, type, and wait. That's enough to drive any normal login form. Here's a complete call against a hypothetical vendor portal.

{`curl -X POST https://api.qcrawl.com/v1/scrape \\
  -H "Authorization: Bearer osk_xxxxxxxxxxxxxxxxx" \\
  -H "Content-Type: application/json" \\
  -d '{
    "url": "https://portal.example-vendor.com/login",
    "actions": [
      { "type": "type", "selector": "input[name=email]", "text": "[email protected]" },
      { "type": "type", "selector": "input[name=password]", "text": "{{VENDOR_PASSWORD}}" },
      { "type": "click", "selector": "button[type=submit]" },
      { "type": "wait", "ms": 2500 },
      { "type": "click", "selector": "a[href=\\"/reports/monthly\\"]" },
      { "type": "wait", "ms": 1500 }
    ],
    "format": "markdown",
    "wait_for": "table.report-rows"
  }'`}

The response is a clean JSON envelope with the rendered markdown of the report page, the final URL, the response status, and timing data. Credentials never leave the request body and Qcrawl does not log them.

{`{
  "ok": true,
  "url": "https://portal.example-vendor.com/reports/monthly",
  "format": "markdown",
  "content": "# Monthly Report — April 2026\\n\\n| Account | Spend | Conv |\\n|---|---|---|\\n| Acme Co | $4,210 | 38 |\\n...",
  "status": 200,
  "took_ms": 4180
}`}

Step 3 — Persistent session with the remote browser

When you need cookies to survive across many calls, open a session against the remote browser. You get a WebSocket endpoint that speaks CDP, the same protocol Puppeteer and Playwright already speak. Your existing automation code points at it and runs.

{`curl -X POST https://api.qcrawl.com/v1/browser/sessions \\
  -H "Authorization: Bearer osk_xxxxxxxxxxxxxxxxx" \\
  -H "Content-Type: application/json" \\
  -d '{
    "ttl_seconds": 1800,
    "label": "vendor-portal-export"
  }'`}

The response gives you a connection URL and a session identifier. Hand the URL to Puppeteer's connect() or Playwright's connectOverCDP(), run your full multi-page workflow, and close the session when you're done. Cookies, local storage, and IndexedDB persist for the lifetime of the session.

{`{
  "session_id": "ses_01HXR3...",
  "ws_endpoint": "wss://browser.qcrawl.com/cdp/ses_01HXR3...",
  "expires_at": "2026-05-16T18:42:00Z"
}`}

Step 4 — Reuse a captured session token instead of replaying the login

If logging in costs a 2FA prompt or triggers a security email, capture the session cookie once by hand and inject it into subsequent calls. Pass it in the cookies array. The login form never runs.

{`curl -X POST https://api.qcrawl.com/v1/scrape \\
  -H "Authorization: Bearer osk_xxxxxxxxxxxxxxxxx" \\
  -H "Content-Type: application/json" \\
  -d '{
    "url": "https://portal.example-vendor.com/reports/monthly",
    "cookies": [
      { "name": "session", "value": "{{CAPTURED_SESSION_VALUE}}", "domain": ".example-vendor.com" }
    ],
    "format": "markdown",
    "wait_for": "table.report-rows"
  }'`}

This is the operational sweet spot for SaaS dashboard backups. You log in by hand once a month, paste the cookie into your secret manager, and the pipeline runs unattended until the cookie expires.

Step 5 — Run many logins in parallel with batch

For pipelines that hit dozens of vendor portals on a schedule, use /v1/scrape/batch. Pass an array of jobs and Qcrawl fans them out, returns a job ID, and posts results to your webhook.

{`curl -X POST https://api.qcrawl.com/v1/scrape/batch \\
  -H "Authorization: Bearer osk_xxxxxxxxxxxxxxxxx" \\
  -H "Content-Type: application/json" \\
  -d '{
    "jobs": [
      { "url": "https://portal-a.example.com/login", "actions": [...] },
      { "url": "https://portal-b.example.com/login", "actions": [...] },
      { "url": "https://portal-c.example.com/login", "actions": [...] }
    ],
    "webhook_url": "https://yourapp.com/hooks/qcrawl"
  }'`}

Step 6 — Wire it into a scheduled workflow

Once the recipe works, move it to Qcrawl's automation page where you can schedule it, attach alerts, and version the action steps. Same JSON, now running on cron, with the output landing in your warehouse or your inbox.

A realistic scenario

Take a mid-sized ecommerce operator. Call her Maya. She runs growth at a Series B brand that sells across nine ad platforms, three marketplaces, and two fulfillment partners. Every Monday morning she pulls a unified spend-and-conv report. Half the sources have decent APIs. The other half make her log into a dashboard and click "Export CSV."

Maya's old setup was a folder of Puppeteer scripts that ran on a teammate's laptop. They broke roughly once a month. She moved the whole thing to Qcrawl over a weekend. Each portal got a recipe: a small JSON file with an actions array and a captured session cookie. The recipes run on schedule, write their output to S3, and post to Slack when one fails. Total platform spend, a fraction of the previous loaded engineering cost.

The interesting thing isn't the cost saving. It's that Maya's team stopped thinking about scraping at all. The pipeline is a configuration artifact, not a codebase. When a vendor redesigns their dashboard, one engineer updates one selector in one file.

Pricing math

Per-request logins run in the same price bucket as a normal rendered scrape. Persistent sessions are billed by the minute the browser is warm. Current rates for both modes live on the pricing page.

A team pulling 30 reports a day across 10 vendors lands in the low double-digit dollars per month. A team running thousands of authenticated extractions per day is talking to us about volume pricing. The pricing page has the current tiers.

The pricing across serious vendors in this space is similar because the underlying cost structure is similar. Pick on developer experience and reliability rather than headline rates.

Operational hygiene for authenticated scraping

Three habits separate teams that run this well from teams that don't. First, treat credentials like production secrets. Vault them, rotate them, and pass them to the scraper at request time, not in source control. Second, monitor for selector drift. When a job starts returning empty results or hitting timeouts on a step that used to work, alert immediately and pause the schedule until a human looks at it.

Third, log what you accessed, when, and why. This is the audit trail your future self and your future counsel will thank you for. Qcrawl provides per-request logs out of the box, and you should mirror them into your own observability stack.

None of this is heroic engineering. It's the same hygiene you already apply to your production database access. Authenticated scraping deserves the same respect.

When to use a managed actor instead

For common targets, the recipe is already written. Qcrawl's actor catalog includes maintained extractors for several portals where login flow and selector maintenance are handled for you. If your job maps to one of them, use it. You get a stable JSON contract and we eat the maintenance.

If your target is bespoke or internal, write the recipe yourself with the actions array. The cost of the recipe is a few hours of engineering. The cost of maintaining it once it's running is almost nothing because the API surface is small.

Either way, the data lands in the same shape, and you can move from one model to the other without rewriting your consumers.

Cross-references and further reading

For multi-page workflows, see the remote browser documentation. For scheduling, see Qcrawl Automation. For the underlying scrape API, see the scrape page and the full docs. The pricing tiers and overage policy are on the pricing page.

External primer reading: the Wikipedia overview of web scraping covers the history and the major legal cases, and the HTTP cookie specification (RFC 6265) is the canonical reference for how sessions actually work over the wire. Both are worth a read if you're building a serious authenticated pipeline.

For a related recipe, see extracting structured data from articles, which uses the same scrape primitives against unauthenticated pages.

Failure modes and how to design around them

Three failure modes show up over and over again in authenticated pipelines, and each one has a predictable answer. Selector drift is the first. A vendor updates their dashboard and the button that used to live at button.export now lives at button[data-testid="export-csv"]. Your job starts returning empty bodies. The fix is to write recipes with multiple fallback selectors and to alert on empty extractions, not just on HTTP errors.

Session expiration is the second. A cookie that worked yesterday returns a redirect to /login today. The fix is to detect the redirect and either re-run the login action or rotate to a fresh captured cookie. Qcrawl's response includes the final URL after redirects, so a single string-match on the response handles this cleanly.

Vendor-side rate limiting is the third. The portal you're scraping decides you're moving too quickly and starts returning 429s or showing interstitial pages. The fix is to space your calls, run them off-peak relative to the vendor's customer base, and respect any Retry-After headers the vendor sends. The platform does the right thing by default and surfaces the signal so you can tune your schedule.

None of these are dramatic. They're the same operational hygiene you'd apply to any third-party API integration, with the only difference being that the contract is implicit instead of documented. Treat the authenticated site as a partner who hasn't published a spec, not as an adversary.

Designing recipes for change

The best authenticated-scraping pipelines we see treat each recipe as a small, versioned artifact. The recipe lives in source control. It carries a description, an owner, an alert recipient, and a list of test cases. When it changes, the change is reviewed.

This sounds heavyweight and it isn't. A recipe is maybe twenty lines of JSON. Putting it in source control adds five minutes the first time and pays off the third time it breaks at midnight and the on-call engineer needs to know what the recipe was supposed to do.

Qcrawl's automation page renders recipes from your repository directly, so the file you commit is the recipe that runs. There's no separate UI state to keep in sync with what's in git, and onboarding a new engineer to the pipeline is a single repo clone.

How Qcrawl's approach compares operationally

Compared with running your own Puppeteer fleet, the platform absorbs the boring parts: keeping browsers warm, handling crashes, rotating egress, capturing screenshots on failure, and surfacing structured logs. The interesting parts, which are the recipe logic and the downstream pipeline, stay in your codebase where they belong.

Compared with vendor-specific actor marketplaces, the platform stays generic. You can write a recipe for a portal nobody else has heard of, and you can write it as fast as you can describe the click path. When the portal is common enough that a managed actor exists, you switch to the actor without rewriting your consumers.

Compared with no-code scraping tools, the platform stays code-first. The recipe is JSON, the call is HTTP, the response is JSON, and every piece of it can be templated, parameterized, and tested. Engineering teams who've outgrown the point-and-click era find this shape familiar and comfortable.

What an authenticated workflow looks like end to end

A complete pipeline has four pieces. Credential management at the top, recipes in the middle, scheduling and observability around them, and a downstream sink at the bottom. Qcrawl handles the middle two and integrates cleanly with whatever you choose for the other two.

For credential management, most teams use the secret manager they already have: AWS Secrets Manager, Vault, Doppler, or 1Password Connect. Credentials are fetched at recipe-execution time and never persisted in the platform. For sinks, most teams write to their warehouse or to S3, with a transformation layer like dbt or a custom worker turning the markdown body into structured rows.

The result is a pipeline that looks indistinguishable from any other ETL job in your stack. Scraping isn't a special discipline anymore. It's an input, like any other vendor integration, with its own quirks and its own ops story.

A note on observability

Every authenticated call Qcrawl runs produces a structured log entry with the recipe ID, the source URL, the final URL, the status code, the timing breakdown, and a content hash. These logs are available in your dashboard and via the API, and they ship cleanly into a SIEM if your security team needs a copy.

The reason this matters is that authenticated scraping has compliance implications that unauthenticated scraping doesn't. Knowing exactly which accounts you accessed, when, and what they returned, is part of the audit trail that lets you answer questions from your legal and security teams quickly. Build the observability in early. Retrofitting it later is more painful than it sounds.

Closing

Authenticated scraping used to be a craft problem. In 2026 it's a configuration problem, and the right platform makes it look small. Pick the shape that fits your job, write the recipe once, schedule it, and move on with the work your team is actually paid to do.

If you want to try Qcrawl against a portal you already have credentials for, grab a key and run the per-request example at the top of this page against your own account. The first thousand calls are on us, and the docs walk through the rest of the surface in detail.