Crawl Website Node

Crawl a website starting from a seed URL, following links up to a specified depth, and extract content from each discovered page.

The Crawl Website node performs a breadth-first crawl of a website. Starting from the seed URL, it discovers and visits linked pages, returning their markdown content, metadata, and outgoing links. It supports domain scoping, path filtering, and both HTTP and browser rendering modes.

Parameters

Parameter	Type	Required	Default	Description
`url`	string	✅	`""`	Seed URL to start crawling from
`maxDepth`	number	❌	`2`	Maximum link-hops from the seed URL
`limit`	number	❌	`100`	Maximum number of pages to crawl
`allowSubdomains`	boolean	❌	`false`	Whether to follow links to subdomains
`allowExternalLinks`	boolean	❌	`false`	Whether to follow links to external domains
`includePaths`	string[]	❌		Regex patterns — only crawl URLs whose path matches
`excludePaths`	string[]	❌		Regex patterns — skip URLs whose path matches
`mode`	string	❌	`browser`	Scraping mode: `http` (fast) or `browser` (dynamic pages)
`instructions`	string	❌	`""`	Additional instructions for crawling (template mode)
`runMode`	string	✅	`item`	Run mode: `item` or `input`

url

Type: string | Required: ✅

The seed URL where crawling begins. All discovered links are relative to this URL's domain by default.

maxDepth

Type: number | Required: ❌ | Default: 2

Controls how many link-hops away from the seed URL the crawler will go. Depth 0 means only the seed page. Depth 1 means the seed page plus pages it links to.

limit

Type: number | Required: ❌ | Default: 100

Maximum number of pages to crawl. The crawler stops once this limit is reached, even if there are more pages to discover.

allowSubdomains

Type: boolean | Required: ❌ | Default: false

When enabled, the crawler will follow links to subdomains of the seed URL's domain (e.g., blog.example.com when crawling example.com).

allowExternalLinks

Type: boolean | Required: ❌ | Default: false

When enabled, the crawler will follow links to entirely different domains.

includePaths / excludePaths

Type: string[] | Required: ❌

Regex patterns for filtering which URLs to crawl. includePaths acts as a whitelist — only URLs matching at least one pattern are crawled. excludePaths acts as a blacklist — URLs matching any pattern are skipped.

mode

Type: string | Required: ❌ | Default: browser

Controls how pages are fetched:

http — Fast HTTP requests. Best for static pages.
browser — Full browser rendering. Best for JavaScript-rendered pages.

Examples

1. Crawl a Documentation Site

Crawl a docs site up to 3 levels deep:

Parameters:

{
  "runMode": "input",
  "url": "https://docs.example.com/",
  "maxDepth": 3,
  "limit": 50,
  "instructions": "Crawl the documentation site."
}

2. Crawl Only Blog Posts

Use path filtering to limit crawling to blog pages:

Parameters:

{
  "runMode": "input",
  "url": "https://example.com/",
  "maxDepth": 2,
  "includePaths": ["/blog/"],
  "instructions": "Crawl blog posts only."
}

3. Crawl Including Subdomains

Parameters:

{
  "runMode": "input",
  "url": "https://example.com/",
  "allowSubdomains": true,
  "limit": 200,
  "instructions": "Crawl the site including all subdomains."
}

Output Structure

The node returns a JSON object with all crawled pages:

{
  "success": true,
  "totalCrawled": 15,
  "pages": [
    {
      "url": "https://example.com/",
      "depth": 0,
      "status": 200,
      "contentType": "text/html",
      "markdown": "# Example\n\nPage content here...",
      "metadata": {
        "title": "Example",
        "description": "An example page",
        "language": "en"
      },
      "links": ["https://example.com/about", "https://example.com/blog"]
    }
  ]
}

Important Notes

Credits: Each crawled page consumes credits. Use limit and maxDepth to control costs.
Mode selection: Use http for speed on static sites. Use browser for JavaScript-rendered content.
Path filtering: Combine includePaths and excludePaths to precisely control which pages are crawled. Patterns are regex matched against the URL path.
Domain scoping: By default, only same-domain links are followed. Enable allowSubdomains or allowExternalLinks to widen the scope.
Deduplication: URLs that differ only by query parameters are treated as the same page to avoid redundant crawling.
Error handling: If a single page fails to load, the crawl continues. Failed pages appear in the output with status: 0 and markdown: null.

Parameters​

url​

maxDepth​

limit​

allowSubdomains​

allowExternalLinks​

includePaths / excludePaths​

mode​

Examples​

1. Crawl a Documentation Site​

2. Crawl Only Blog Posts​

3. Crawl Including Subdomains​

Output Structure​

Important Notes​

Parameters

url

maxDepth

limit

allowSubdomains

allowExternalLinks

includePaths / excludePaths

mode

Examples

1. Crawl a Documentation Site

2. Crawl Only Blog Posts

3. Crawl Including Subdomains

Output Structure

Important Notes