Extracting Website Content for Knowledge Base

This guide shows you how to use external webscraping tools to extract content from websites and import it into your SipPulse AI Knowledge Base. We recommend using specialized tools designed specifically for web data extraction.

Why Use External Scraping Tools?

External webscraping tools offer several advantages:

Specialized functionality: Tools like Web Scraper.io and Octoparse are purpose-built for data extraction
Visual interfaces: Point-and-click selection of content without coding
Export flexibility: Direct export to CSV, JSON, or Excel formats
Scheduling: Many tools support automated periodic scraping
Better control: Fine-grained control over what content to extract

Recommended Tools

1. Web Scraper.io (Free Chrome Extension)

Best for: Non-technical users who want a free, simple solution

Website: https://webscraper.io/
Cost: Free (Chrome extension with unlimited local use)
Export formats: CSV, XLSX
Difficulty: Beginner-friendly

Why We Recommend It

Web Scraper.io requires no coding, works directly in your browser, and exports to CSV format that imports directly into Knowledge Base.

2. Octoparse

Best for: Users who need AI-assisted scraping with more features

Website: https://www.octoparse.com/
Cost: Free tier available, paid plans for advanced features
Export formats: CSV, Excel, JSON, Google Sheets
Difficulty: Beginner-friendly with AI auto-detection

3. ParseHub

Best for: Complex websites with dynamic content

Website: https://www.parsehub.com/
Cost: Free tier (5 projects), paid plans available
Export formats: CSV, JSON, Excel
Difficulty: Intermediate

Tutorial: Scraping the SipPulse AI Documentation

Let's walk through a practical example: extracting content from the SipPulse AI documentation at https://docs.sippulse.ai to create a knowledge base that an agent can use to answer questions about the platform.

Step 1: Install Web Scraper.io

Open Google Chrome
Go to Web Scraper.io Extension
Click Add to Chrome

Step 2: Create a Sitemap for SipPulse Docs

Navigate to https://docs.sippulse.ai
Press F12 to open Chrome DevTools
Click the Web Scraper tab
Click Create new sitemap > Create Sitemap
Configure:
- Sitemap name: sippulse-docs
- Start URL: https://docs.sippulse.ai

Click Add new selector
Configure the first selector to capture sidebar navigation:
- Id: nav-links
- Type: Link
- Selector: Click Select, then click on sidebar links (like "Get Started", "Agents", etc.)
- Check Multiple (to capture all links)
Click Done selecting and Save selector

Step 4: Create Content Selectors

Click on nav-links selector, then Add new selector (child selector)
Configure title selector:
- Id: title
- Type: Text
- Selector: Click Select, then click on the page title (h1)
Add another child selector for content:
- Id: content
- Type: Text
- Selector: .vp-doc (the main content container in VitePress)
Add URL selector:
- Id: page-url
- Type: Text
- Selector: _url_ (special selector for current URL)

Step 5: Run the Scraper

Click Sitemap (sippulse-docs) > Scrape
Set Request interval: 2000 (2 seconds between requests)
Click Start scraping
Wait for completion (the scraper will navigate through all pages)

Step 6: Export and Clean

Click Sitemap (sippulse-docs) > Export data as CSV
Open the CSV in Google Sheets or Excel
Clean the data:
- Remove rows with empty content
- Remove duplicate URLs
- Rename columns to: title, content, url
Save as CSV (UTF-8 encoding)

Step 7: Import into Knowledge Base

In SipPulse AI, go to Knowledge Base
Click + Create Table > Upload File
Configure:
- Name: sippulse_docs_kb
- Description: "SipPulse AI platform documentation for answering user questions about features and usage"
- Embedding Model: text-embedding-3-large
Upload your CSV
Click Save

Step 8: Test Your Knowledge Base

Click on the created table
Click Query
Test with questions like:
- "How do I create an agent?"
- "What models are available for text-to-speech?"
- "How do I configure webhooks?"
Verify the returned snippets are relevant

CSV Format Requirements

For best results, structure your CSV with these columns:

Column	Required	Description
`content`	Yes	The main text content to be vectorized
`title`	Recommended	Page or section title
`url`	Recommended	Source URL for reference
`category`	Optional	Category or section name

Example CSV:

csv

title,content,url,category
"Getting Started","Welcome to our platform. This guide will help you...","https://docs.example.com/start","basics"
"API Authentication","All API requests require authentication using...","https://docs.example.com/api/auth","api"

JSON Format Alternative

You can also use JSON format for more structured data:

json

[
  {
    "title": "Getting Started",
    "content": "Welcome to our platform. This guide will help you...",
    "url": "https://docs.example.com/start",
    "category": "basics"
  },
  {
    "title": "API Authentication",
    "content": "All API requests require authentication using...",
    "url": "https://docs.example.com/api/auth",
    "category": "api"
  }
]

Best Practices

Content Quality

Remove navigation elements: Exclude menus, footers, and sidebars from your selectors
Keep chunks reasonable: Aim for 500-2000 words per entry for optimal semantic search
Include context: Add titles and URLs so the agent can reference sources

Respecting Websites

Check robots.txt: Ensure the website allows scraping
Use delays: Set reasonable intervals between requests (2+ seconds)
Don't overload servers: Limit concurrent requests
Respect terms of service: Some websites prohibit automated scraping

Keeping Content Updated

Since scraped content is static, establish a routine to:

Re-run your scraper periodically (weekly/monthly)
Export new CSV files
Delete and recreate the Knowledge Base table with fresh data
Or use the Synchronize function after adding new rows

Troubleshooting

Q: The scraper isn't capturing all pages

Check your link selector depth and ensure you're following all navigation links. Increase the maximum pages limit if needed.

Q: Content has HTML tags or formatting issues

Most scrapers extract raw text, but some may include HTML. Clean the data in a spreadsheet before importing.

Q: The CSV import fails

Ensure your CSV:

Uses UTF-8 encoding
Has proper column headers
Doesn't exceed the maximum file size (check platform limits)
Has content in each row

Next Steps

After importing your website content:

Test semantic search: Use the Query function to verify content was indexed correctly
Connect to an agent: Add the Knowledge Base as a tool in your agent configuration
Validate responses: Test the agent with questions your documentation should answer

For a complete walkthrough of creating an agent with Knowledge Base, see our Support Agent Tutorial.

Agents

Configuration

Tools

Advanced

Deploying Agents

Settings

Extracting Website Content for Knowledge Base

Why Use External Scraping Tools?

Recommended Tools

1. Web Scraper.io (Free Chrome Extension)

2. Octoparse

3. ParseHub

Tutorial: Scraping the SipPulse AI Documentation

Step 1: Install Web Scraper.io

Step 2: Create a Sitemap for SipPulse Docs

Step 3: Configure Navigation Links

Step 4: Create Content Selectors

Step 5: Run the Scraper

Step 6: Export and Clean

Step 7: Import into Knowledge Base

Step 8: Test Your Knowledge Base

CSV Format Requirements

JSON Format Alternative

Best Practices

Content Quality

Respecting Websites

Keeping Content Updated

Troubleshooting

Next Steps

Configuration

Tools

Advanced

Deploying Agents

Extracting Website Content for Knowledge Base ​

Why Use External Scraping Tools? ​

Recommended Tools ​

1. Web Scraper.io (Free Chrome Extension) ​

2. Octoparse ​

3. ParseHub ​

Tutorial: Scraping the SipPulse AI Documentation ​

Step 1: Install Web Scraper.io ​

Step 2: Create a Sitemap for SipPulse Docs ​

Step 3: Configure Navigation Links ​

Step 4: Create Content Selectors ​

Step 5: Run the Scraper ​

Step 6: Export and Clean ​

Step 7: Import into Knowledge Base ​

Step 8: Test Your Knowledge Base ​

CSV Format Requirements ​

JSON Format Alternative ​

Best Practices ​

Content Quality ​

Respecting Websites ​

Keeping Content Updated ​

Troubleshooting ​

Next Steps ​

Extracting Website Content for Knowledge Base

Why Use External Scraping Tools?

Recommended Tools

1. Web Scraper.io (Free Chrome Extension)

2. Octoparse

3. ParseHub

Tutorial: Scraping the SipPulse AI Documentation

Step 1: Install Web Scraper.io

Step 2: Create a Sitemap for SipPulse Docs

Step 3: Configure Navigation Links

Step 4: Create Content Selectors

Step 5: Run the Scraper

Step 6: Export and Clean

Step 7: Import into Knowledge Base

Step 8: Test Your Knowledge Base

CSV Format Requirements

JSON Format Alternative

Best Practices

Content Quality

Respecting Websites

Keeping Content Updated

Troubleshooting

Next Steps