# Web Page Crawler

### Overview

Administrators can create a Knowledge Base by crawling content directly from web pages. This method is useful for automatically extracting and updating information from external sources without manual input. The **Web Page Crawler** configuration allows you to define specific crawl parameters to ensure relevance, accuracy, and performance efficiency.&#x20;

***

### Data Source Selection

Before configuring the knowledge base details, you must choose the data source type. This determines how information will be fetched and indexed.

From the **Admin Portal**, navigate to **AI Studio > External KBs > Create a Knowledge Base**.

<figure><img src="/files/PFsZUaDvdHOglrNWBIDq" alt="" width="563"><figcaption></figcaption></figure>

Under **Data Source**, select **Web Page Crawler** from the dropdown menu.

This selection enables options specific to web crawling, including crawl scope, filters, and frequency.

***

### Configure Knowledge Base

This section captures the metadata for your Knowledge Base to help identify and describe its purpose and scope.

**Knowledge Base Name**: Enter a descriptive name (max 50 characters).

**Description**: Optionally, add a summary of the knowledge base's purpose (max 200 characters).

**Source URL**: Provide one or more URLs from which to crawl data. Click **+ Add Source URL** to include multiple entries.

Ensure URLs are publicly accessible and suitable for automated crawling.

***

### Crawl Scope

The crawl scope defines the boundaries and depth of the website crawling process. Proper configuration helps avoid excessive or irrelevant data fetching.

**Website Domain Range**:

* **Default** (recommended): Limits crawling to web pages that belong to the same host and with the same initial URL path. For example, with a seed URL of '<https://mango.mangoapps.com/company/>' then only this path and web pages that extend from this path will be crawled, e.g. '<https://mango.mangoapps.com/company/reports>' . Sibling URLs, like '<https://mango.mangoapps.com/ec2>', are not crawled.
* **Host Only**: Limits crawling to web pages that belong to the same host. For example, with a seed URL of '<https://mango.mangoapps.com/company/>', then web pages with '<https://mango.mangoapps.com>' will also be crawled, e.g. '<https://mango.mangoapps.com/ec2/>'.
* **Subdomains**: Includes crawling of any web page that has the same primary domain as the seed URL. For example, with a seed URL of '<https://mango.mangoapps.com/company/>', now any web page that contains 'mangoapps.com' will be crawled, e.g. '<https://www.mangoapps.com>'.

**Scope**: Set the crawl rate limit (e.g., 60 URLs per host per minute) to manage performance and reduce server load.

**User Agent**: Enter a custom user agent string if needed. This identifies the crawler to the web server (max 40 characters).

**URL Regex Filter**: Click **Configure Regex Filters** to include or exclude URLs based on regular expressions. Useful for targeting or filtering out specific paths or patterns.

{% hint style="warning" %}
Avoid crawling large public sites, like Wikipedia, without appropriate filters, as it can take a very long time and may overload your system.
{% endhint %}

***

### Crawl Schedule

Crawl scheduling allows you to set how often the system checks for updates on the target web pages, ensuring your knowledge base stays up-to-date.

#### Settings

* **Crawl Schedule Toggle**: Enable or disable scheduled crawling.
* **Frequency**: Choose how often the system crawls the specified URLs. Default options include Once Every 8 Hours, Once a Day, Once a Week, or you can configure the frequency manually.

Select a frequency that aligns with how often the source content changes.

***

Once all configurations are completed, save your setup to activate the crawler.

Clicking the **Save** button applies all settings and starts the crawl process based on the defined parameters.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://guides.mangoapps.com/ai-guide/external-knowledge-bases/web-page-crawler.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
