Skip to main content

Crawl Website Node

Crawl a website starting from a seed URL, following links up to a specified depth, and extract content from each discovered page.

The Crawl Website node performs a breadth-first crawl of a website. Starting from the seed URL, it discovers and visits linked pages, returning their markdown content, metadata, and outgoing links. It supports domain scoping, path filtering, and both HTTP and browser rendering modes.

Parameters

ParameterTypeRequiredDefaultDescription
urlstring""Seed URL to start crawling from
maxDepthnumber2Maximum link-hops from the seed URL
limitnumber100Maximum number of pages to crawl
allowSubdomainsbooleanfalseWhether to follow links to subdomains
allowExternalLinksbooleanfalseWhether to follow links to external domains
includePathsstring[]Regex patterns — only crawl URLs whose path matches
excludePathsstring[]Regex patterns — skip URLs whose path matches
modestringbrowserScraping mode: http (fast) or browser (dynamic pages)
instructionsstring""Additional instructions for crawling (template mode)
runModestringitemRun mode: item or input

url

Type: string | Required:

The seed URL where crawling begins. All discovered links are relative to this URL's domain by default.

maxDepth

Type: number | Required: ❌ | Default: 2

Controls how many link-hops away from the seed URL the crawler will go. Depth 0 means only the seed page. Depth 1 means the seed page plus pages it links to.

limit

Type: number | Required: ❌ | Default: 100

Maximum number of pages to crawl. The crawler stops once this limit is reached, even if there are more pages to discover.

allowSubdomains

Type: boolean | Required: ❌ | Default: false

When enabled, the crawler will follow links to subdomains of the seed URL's domain (e.g., blog.example.com when crawling example.com).

Type: boolean | Required: ❌ | Default: false

When enabled, the crawler will follow links to entirely different domains.

includePaths / excludePaths

Type: string[] | Required:

Regex patterns for filtering which URLs to crawl. includePaths acts as a whitelist — only URLs matching at least one pattern are crawled. excludePaths acts as a blacklist — URLs matching any pattern are skipped.

mode

Type: string | Required: ❌ | Default: browser

Controls how pages are fetched:

  • http — Fast HTTP requests. Best for static pages.
  • browser — Full browser rendering. Best for JavaScript-rendered pages.

Examples

1. Crawl a Documentation Site

Crawl a docs site up to 3 levels deep:

Parameters:

{
"runMode": "input",
"url": "https://docs.example.com/",
"maxDepth": 3,
"limit": 50,
"instructions": "Crawl the documentation site."
}

2. Crawl Only Blog Posts

Use path filtering to limit crawling to blog pages:

Parameters:

{
"runMode": "input",
"url": "https://example.com/",
"maxDepth": 2,
"includePaths": ["/blog/"],
"instructions": "Crawl blog posts only."
}

3. Crawl Including Subdomains

Parameters:

{
"runMode": "input",
"url": "https://example.com/",
"allowSubdomains": true,
"limit": 200,
"instructions": "Crawl the site including all subdomains."
}

Output Structure

The node returns a JSON object with all crawled pages:

{
"success": true,
"totalCrawled": 15,
"pages": [
{
"url": "https://example.com/",
"depth": 0,
"status": 200,
"contentType": "text/html",
"markdown": "# Example\n\nPage content here...",
"metadata": {
"title": "Example",
"description": "An example page",
"language": "en"
},
"links": ["https://example.com/about", "https://example.com/blog"]
}
]
}

Important Notes

  • Credits: Each crawled page consumes credits. Use limit and maxDepth to control costs.
  • Mode selection: Use http for speed on static sites. Use browser for JavaScript-rendered content.
  • Path filtering: Combine includePaths and excludePaths to precisely control which pages are crawled. Patterns are regex matched against the URL path.
  • Domain scoping: By default, only same-domain links are followed. Enable allowSubdomains or allowExternalLinks to widen the scope.
  • Deduplication: URLs that differ only by query parameters are treated as the same page to avoid redundant crawling.
  • Error handling: If a single page fails to load, the crawl continues. Failed pages appear in the output with status: 0 and markdown: null.