Web Page Crawler

Overview

Administrators can create a Knowledge Base by crawling content directly from web pages. This method is useful for automatically extracting and updating information from external sources without manual input. The Web Page Crawler configuration allows you to define specific crawl parameters to ensure relevance, accuracy, and performance efficiency.

Data Source Selection

Before configuring the knowledge base details, you must choose the data source type. This determines how information will be fetched and indexed.

From the Admin Portal, navigate to AI Studio > External KBs > Create a Knowledge Base.

Under Data Source, select Web Page Crawler from the dropdown menu.

This selection enables options specific to web crawling, including crawl scope, filters, and frequency.

Configure Knowledge Base

This section captures the metadata for your Knowledge Base to help identify and describe its purpose and scope.

Knowledge Base Name: Enter a descriptive name (max 50 characters).

Description: Optionally, add a summary of the knowledge base's purpose (max 200 characters).

Source URL: Provide one or more URLs from which to crawl data. Click + Add Source URL to include multiple entries.

Ensure URLs are publicly accessible and suitable for automated crawling.

Crawl Scope

The crawl scope defines the boundaries and depth of the website crawling process. Proper configuration helps avoid excessive or irrelevant data fetching.

Website Domain Range:

Default (recommended): Limits crawling to web pages that belong to the same host and with the same initial URL path. For example, with a seed URL of 'https://mango.mangoapps.com/company/' then only this path and web pages that extend from this path will be crawled, e.g. 'https://mango.mangoapps.com/company/reports' . Sibling URLs, like 'https://mango.mangoapps.com/ec2', are not crawled.
Host Only: Limits crawling to web pages that belong to the same host. For example, with a seed URL of 'https://mango.mangoapps.com/company/', then web pages with 'https://mango.mangoapps.com' will also be crawled, e.g. 'https://mango.mangoapps.com/ec2/'.
Subdomains: Includes crawling of any web page that has the same primary domain as the seed URL. For example, with a seed URL of 'https://mango.mangoapps.com/company/', now any web page that contains 'mangoapps.com' will be crawled, e.g. 'https://www.mangoapps.com'.

Scope: Set the crawl rate limit (e.g., 60 URLs per host per minute) to manage performance and reduce server load.

User Agent: Enter a custom user agent string if needed. This identifies the crawler to the web server (max 40 characters).

URL Regex Filter: Click Configure Regex Filters to include or exclude URLs based on regular expressions. Useful for targeting or filtering out specific paths or patterns.

Avoid crawling large public sites, like Wikipedia, without appropriate filters, as it can take a very long time and may overload your system.

Crawl Schedule

Crawl scheduling allows you to set how often the system checks for updates on the target web pages, ensuring your knowledge base stays up-to-date.

Settings

Crawl Schedule Toggle: Enable or disable scheduled crawling.
Frequency: Choose how often the system crawls the specified URLs. Default options include Once Every 8 Hours, Once a Day, Once a Week, or you can configure the frequency manually.

Select a frequency that aligns with how often the source content changes.

Once all configurations are completed, save your setup to activate the crawler.

Clicking the Save button applies all settings and starts the crawl process based on the defined parameters.

PreviousKnowledge Base: SharePoint NextDownload the App

Last updated 3 months ago