Guide Index
Guides HomeMangoApps WebsiteCommunity & SupportBlog & Resources
MangoApps AI Guide
MangoApps AI Guide
  • 🥭MangoApps AI Guide
  • Overview
    • Overview of MangoApps AI Studio
    • Introduction to AI Assistants
      • HR Support AI Assistant Use Case
      • Company Knowledge AI Assistant Use Case
      • IT Support AI Assistant Use Case
  • User Portal
    • User Getting Started
    • 🆕AI Assistants
      • 🆕Using AI Assistants
      • 🆕Out of the Box AI Assistants
    • AI Agents
      • Out of the Box AI Agents
      • Using AI Agents in Trackers
  • Admin Portal
    • Admin Getting Started
    • AI Admin Role
    • 🆕AI Studio Module
      • 🆕AI Studio Insights
      • AI Assistants
        • Create an Assistant
        • 🆕Writing Instructions for AI Assistants
        • Use Case Examples
          • Setting up an HR Support AI Assistant
          • Setting up an IT Support AI Assistant
      • AI Agents
        • Create an Agent
      • AI Service Providers
      • Templates
      • 🆕AI Settings
    • Understanding Your Knowledge Base
    • Navigation Menu
    • Widget
  • Bring Your Own Assistant
    • 🆕Custom Embed Assistant (BYOA)
    • Integration with AWS Bedrock
  • External Knowledge Bases
    • 🆕Overview of External Knowledge Bases
    • 🆕Knowledge Base: Confluence (Basic Authentication)
    • 🆕Knowledge Base: Confluence (OAuth 2.0 Authentication)
    • 🆕Knowledge Base: SharePoint
    • 🆕Web Page Crawler
  • Mobile
    • Download the App
  • Additional Resources
    • AI Studio by MangoApps
Powered by GitBook
On this page
  • Overview
  • Data Source Selection
  • Configure Knowledge Base
  • Crawl Scope
  • Crawl Schedule
  1. External Knowledge Bases

Web Page Crawler

PreviousKnowledge Base: SharePointNextDownload the App

Last updated 2 days ago

Overview

Administrators can create a Knowledge Base by crawling content directly from web pages. This method is useful for automatically extracting and updating information from external sources without manual input. The Web Page Crawler configuration allows you to define specific crawl parameters to ensure relevance, accuracy, and performance efficiency.


Data Source Selection

Before configuring the knowledge base details, you must choose the data source type. This determines how information will be fetched and indexed.

From the Admin Portal, navigate to AI Studio > External KBs > Create a Knowledge Base.

Under Data Source, select Web Page Crawler from the dropdown menu.

This selection enables options specific to web crawling, including crawl scope, filters, and frequency.


Configure Knowledge Base

This section captures the metadata for your Knowledge Base to help identify and describe its purpose and scope.

Knowledge Base Name: Enter a descriptive name (max 50 characters).

Description: Optionally, add a summary of the knowledge base's purpose (max 200 characters).

Source URL: Provide one or more URLs from which to crawl data. Click + Add Source URL to include multiple entries.

Ensure URLs are publicly accessible and suitable for automated crawling.


Crawl Scope

The crawl scope defines the boundaries and depth of the website crawling process. Proper configuration helps avoid excessive or irrelevant data fetching.

Website Domain Range:

  • Default (recommended): Limits crawling to web pages that belong to the same host and with the same initial URL path. For example, with a seed URL of 'https://mango.mangoapps.com/company/' then only this path and web pages that extend from this path will be crawled, e.g. 'https://mango.mangoapps.com/company/reports' . Sibling URLs, like 'https://mango.mangoapps.com/ec2', are not crawled.

  • Host Only: Limits crawling to web pages that belong to the same host. For example, with a seed URL of 'https://mango.mangoapps.com/company/', then web pages with 'https://mango.mangoapps.com' will also be crawled, e.g. 'https://mango.mangoapps.com/ec2/'.

  • Subdomains: Includes crawling of any web page that has the same primary domain as the seed URL. For example, with a seed URL of 'https://mango.mangoapps.com/company/', now any web page that contains 'mangoapps.com' will be crawled, e.g. 'https://www.mangoapps.com'.

Scope: Set the crawl rate limit (e.g., 60 URLs per host per minute) to manage performance and reduce server load.

User Agent: Enter a custom user agent string if needed. This identifies the crawler to the web server (max 40 characters).

URL Regex Filter: Click Configure Regex Filters to include or exclude URLs based on regular expressions. Useful for targeting or filtering out specific paths or patterns.

Avoid crawling large public sites, like Wikipedia, without appropriate filters, as it can take a very long time and may overload your system.


Crawl Schedule

Crawl scheduling allows you to set how often the system checks for updates on the target web pages, ensuring your knowledge base stays up-to-date.

Settings

  • Crawl Schedule Toggle: Enable or disable scheduled crawling.

  • Frequency: Choose how often the system crawls the specified URLs. Default options include Once Every 8 Hours, Once a Day, Once a Week, or you can configure the frequency manually.

Select a frequency that aligns with how often the source content changes.


Once all configurations are completed, save your setup to activate the crawler.

Clicking the Save button applies all settings and starts the crawl process based on the defined parameters.

🆕