Skip to content

Extracting Website Content for Knowledge Base

This guide shows you how to use external webscraping tools to extract content from websites and import it into your SipPulse AI Knowledge Base. We recommend using specialized tools designed specifically for web data extraction.

Why Use External Scraping Tools?

External webscraping tools offer several advantages:

  • Specialized functionality: Tools like Web Scraper.io and Octoparse are purpose-built for data extraction
  • Visual interfaces: Point-and-click selection of content without coding
  • Export flexibility: Direct export to CSV, JSON, or Excel formats
  • Scheduling: Many tools support automated periodic scraping
  • Better control: Fine-grained control over what content to extract

1. Web Scraper.io (Free Chrome Extension)

Best for: Non-technical users who want a free, simple solution

  • Website: https://webscraper.io/
  • Cost: Free (Chrome extension with unlimited local use)
  • Export formats: CSV, XLSX
  • Difficulty: Beginner-friendly

Why We Recommend It

Web Scraper.io requires no coding, works directly in your browser, and exports to CSV format that imports directly into Knowledge Base.

2. Octoparse

Best for: Users who need AI-assisted scraping with more features

  • Website: https://www.octoparse.com/
  • Cost: Free tier available, paid plans for advanced features
  • Export formats: CSV, Excel, JSON, Google Sheets
  • Difficulty: Beginner-friendly with AI auto-detection

3. ParseHub

Best for: Complex websites with dynamic content

  • Website: https://www.parsehub.com/
  • Cost: Free tier (5 projects), paid plans available
  • Export formats: CSV, JSON, Excel
  • Difficulty: Intermediate

Tutorial: Scraping the SipPulse AI Documentation

Let's walk through a practical example: extracting content from the SipPulse AI documentation at https://docs.sippulse.ai to create a knowledge base that an agent can use to answer questions about the platform.

Step 1: Install Web Scraper.io

  1. Open Google Chrome
  2. Go to Web Scraper.io Extension
  3. Click Add to Chrome

Step 2: Create a Sitemap for SipPulse Docs

  1. Navigate to https://docs.sippulse.ai
  2. Press F12 to open Chrome DevTools
  3. Click the Web Scraper tab
  4. Click Create new sitemap > Create Sitemap
  5. Configure:
    • Sitemap name: sippulse-docs
    • Start URL: https://docs.sippulse.ai
  1. Click Add new selector
  2. Configure the first selector to capture sidebar navigation:
    • Id: nav-links
    • Type: Link
    • Selector: Click Select, then click on sidebar links (like "Get Started", "Agents", etc.)
    • Check Multiple (to capture all links)
  3. Click Done selecting and Save selector

Step 4: Create Content Selectors

  1. Click on nav-links selector, then Add new selector (child selector)
  2. Configure title selector:
    • Id: title
    • Type: Text
    • Selector: Click Select, then click on the page title (h1)
  3. Add another child selector for content:
    • Id: content
    • Type: Text
    • Selector: .vp-doc (the main content container in VitePress)
  4. Add URL selector:
    • Id: page-url
    • Type: Text
    • Selector: _url_ (special selector for current URL)

Step 5: Run the Scraper

  1. Click Sitemap (sippulse-docs) > Scrape
  2. Set Request interval: 2000 (2 seconds between requests)
  3. Click Start scraping
  4. Wait for completion (the scraper will navigate through all pages)

Step 6: Export and Clean

  1. Click Sitemap (sippulse-docs) > Export data as CSV
  2. Open the CSV in Google Sheets or Excel
  3. Clean the data:
    • Remove rows with empty content
    • Remove duplicate URLs
    • Rename columns to: title, content, url
  4. Save as CSV (UTF-8 encoding)

Step 7: Import into Knowledge Base

  1. In SipPulse AI, go to Knowledge Base
  2. Click + Create Table > Upload File
  3. Configure:
    • Name: sippulse_docs_kb
    • Description: "SipPulse AI platform documentation for answering user questions about features and usage"
    • Embedding Model: text-embedding-3-large
  4. Upload your CSV
  5. Click Save

Step 8: Test Your Knowledge Base

  1. Click on the created table
  2. Click Query
  3. Test with questions like:
    • "How do I create an agent?"
    • "What models are available for text-to-speech?"
    • "How do I configure webhooks?"
  4. Verify the returned snippets are relevant

CSV Format Requirements

For best results, structure your CSV with these columns:

ColumnRequiredDescription
contentYesThe main text content to be vectorized
titleRecommendedPage or section title
urlRecommendedSource URL for reference
categoryOptionalCategory or section name

Example CSV:

csv
title,content,url,category
"Getting Started","Welcome to our platform. This guide will help you...","https://docs.example.com/start","basics"
"API Authentication","All API requests require authentication using...","https://docs.example.com/api/auth","api"

JSON Format Alternative

You can also use JSON format for more structured data:

json
[
  {
    "title": "Getting Started",
    "content": "Welcome to our platform. This guide will help you...",
    "url": "https://docs.example.com/start",
    "category": "basics"
  },
  {
    "title": "API Authentication",
    "content": "All API requests require authentication using...",
    "url": "https://docs.example.com/api/auth",
    "category": "api"
  }
]

Best Practices

Content Quality

  • Remove navigation elements: Exclude menus, footers, and sidebars from your selectors
  • Keep chunks reasonable: Aim for 500-2000 words per entry for optimal semantic search
  • Include context: Add titles and URLs so the agent can reference sources

Respecting Websites

  • Check robots.txt: Ensure the website allows scraping
  • Use delays: Set reasonable intervals between requests (2+ seconds)
  • Don't overload servers: Limit concurrent requests
  • Respect terms of service: Some websites prohibit automated scraping

Keeping Content Updated

Since scraped content is static, establish a routine to:

  1. Re-run your scraper periodically (weekly/monthly)
  2. Export new CSV files
  3. Delete and recreate the Knowledge Base table with fresh data
  4. Or use the Synchronize function after adding new rows

Troubleshooting

Q: The scraper isn't capturing all pages

Check your link selector depth and ensure you're following all navigation links. Increase the maximum pages limit if needed.

Q: Content has HTML tags or formatting issues

Most scrapers extract raw text, but some may include HTML. Clean the data in a spreadsheet before importing.

Q: The CSV import fails

Ensure your CSV:

  • Uses UTF-8 encoding
  • Has proper column headers
  • Doesn't exceed the maximum file size (check platform limits)
  • Has content in each row

Next Steps

After importing your website content:

  1. Test semantic search: Use the Query function to verify content was indexed correctly
  2. Connect to an agent: Add the Knowledge Base as a tool in your agent configuration
  3. Validate responses: Test the agent with questions your documentation should answer

For a complete walkthrough of creating an agent with Knowledge Base, see our Support Agent Tutorial.