Crawl Website Node
Crawl a website starting from a seed URL, following links up to a specified depth, and extract content from each discovered page.
The Crawl Website node performs a breadth-first crawl of a website. Starting from the seed URL, it discovers and visits linked pages, returning their markdown content, metadata, and outgoing links. It supports domain scoping, path filtering, and both HTTP and browser rendering modes.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | ✅ | "" | Seed URL to start crawling from |
maxDepth | number | ❌ | 2 | Maximum link-hops from the seed URL |
limit | number | ❌ | 100 | Maximum number of pages to crawl |
allowSubdomains | boolean | ❌ | false | Whether to follow links to subdomains |
allowExternalLinks | boolean | ❌ | false | Whether to follow links to external domains |
includePaths | string[] | ❌ | Regex patterns — only crawl URLs whose path matches | |
excludePaths | string[] | ❌ | Regex patterns — skip URLs whose path matches | |
mode | string | ❌ | browser | Scraping mode: http (fast) or browser (dynamic pages) |
instructions | string | ❌ | "" | Additional instructions for crawling (template mode) |
runMode | string | ✅ | item | Run mode: item or input |
url
Type: string | Required: ✅
The seed URL where crawling begins. All discovered links are relative to this URL's domain by default.
maxDepth
Type: number | Required: ❌ | Default: 2
Controls how many link-hops away from the seed URL the crawler will go. Depth 0 means only the seed page. Depth 1 means the seed page plus pages it links to.
limit
Type: number | Required: ❌ | Default: 100
Maximum number of pages to crawl. The crawler stops once this limit is reached, even if there are more pages to discover.
allowSubdomains
Type: boolean | Required: ❌ | Default: false
When enabled, the crawler will follow links to subdomains of the seed URL's domain (e.g., blog.example.com when crawling example.com).
allowExternalLinks
Type: boolean | Required: ❌ | Default: false
When enabled, the crawler will follow links to entirely different domains.
includePaths / excludePaths
Type: string[] | Required: ❌
Regex patterns for filtering which URLs to crawl. includePaths acts as a whitelist — only URLs matching at least one pattern are crawled. excludePaths acts as a blacklist — URLs matching any pattern are skipped.
mode
Type: string | Required: ❌ | Default: browser
Controls how pages are fetched:
http— Fast HTTP requests. Best for static pages.browser— Full browser rendering. Best for JavaScript-rendered pages.
Examples
1. Crawl a Documentation Site
Crawl a docs site up to 3 levels deep:
Parameters:
{
"runMode": "input",
"url": "https://docs.example.com/",
"maxDepth": 3,
"limit": 50,
"instructions": "Crawl the documentation site."
}
2. Crawl Only Blog Posts
Use path filtering to limit crawling to blog pages:
Parameters:
{
"runMode": "input",
"url": "https://example.com/",
"maxDepth": 2,
"includePaths": ["/blog/"],
"instructions": "Crawl blog posts only."
}
3. Crawl Including Subdomains
Parameters:
{
"runMode": "input",
"url": "https://example.com/",
"allowSubdomains": true,
"limit": 200,
"instructions": "Crawl the site including all subdomains."
}
Output Structure
The node returns a JSON object with all crawled pages:
{
"success": true,
"totalCrawled": 15,
"pages": [
{
"url": "https://example.com/",
"depth": 0,
"status": 200,
"contentType": "text/html",
"markdown": "# Example\n\nPage content here...",
"metadata": {
"title": "Example",
"description": "An example page",
"language": "en"
},
"links": ["https://example.com/about", "https://example.com/blog"]
}
]
}
Important Notes
- Credits: Each crawled page consumes credits. Use
limitandmaxDepthto control costs. - Mode selection: Use
httpfor speed on static sites. Usebrowserfor JavaScript-rendered content. - Path filtering: Combine
includePathsandexcludePathsto precisely control which pages are crawled. Patterns are regex matched against the URL path. - Domain scoping: By default, only same-domain links are followed. Enable
allowSubdomainsorallowExternalLinksto widen the scope. - Deduplication: URLs that differ only by query parameters are treated as the same page to avoid redundant crawling.
- Error handling: If a single page fails to load, the crawl continues. Failed pages appear in the output with
status: 0andmarkdown: null.