Extracting Website Content for Knowledge Base
This guide shows you how to use external webscraping tools to extract content from websites and import it into your SipPulse AI Knowledge Base. We recommend using specialized tools designed specifically for web data extraction.
Why Use External Scraping Tools?
External webscraping tools offer several advantages:
- Specialized functionality: Tools like Web Scraper.io and Octoparse are purpose-built for data extraction
- Visual interfaces: Point-and-click selection of content without coding
- Export flexibility: Direct export to CSV, JSON, or Excel formats
- Scheduling: Many tools support automated periodic scraping
- Better control: Fine-grained control over what content to extract
Recommended Tools
1. Web Scraper.io (Free Chrome Extension)
Best for: Non-technical users who want a free, simple solution
- Website: https://webscraper.io/
- Cost: Free (Chrome extension with unlimited local use)
- Export formats: CSV, XLSX
- Difficulty: Beginner-friendly
Why We Recommend It
Web Scraper.io requires no coding, works directly in your browser, and exports to CSV format that imports directly into Knowledge Base.
2. Octoparse
Best for: Users who need AI-assisted scraping with more features
- Website: https://www.octoparse.com/
- Cost: Free tier available, paid plans for advanced features
- Export formats: CSV, Excel, JSON, Google Sheets
- Difficulty: Beginner-friendly with AI auto-detection
3. ParseHub
Best for: Complex websites with dynamic content
- Website: https://www.parsehub.com/
- Cost: Free tier (5 projects), paid plans available
- Export formats: CSV, JSON, Excel
- Difficulty: Intermediate
Tutorial: Scraping the SipPulse AI Documentation
Let's walk through a practical example: extracting content from the SipPulse AI documentation at https://docs.sippulse.ai to create a knowledge base that an agent can use to answer questions about the platform.
Step 1: Install Web Scraper.io
- Open Google Chrome
- Go to Web Scraper.io Extension
- Click Add to Chrome
Step 2: Create a Sitemap for SipPulse Docs
- Navigate to
https://docs.sippulse.ai - Press F12 to open Chrome DevTools
- Click the Web Scraper tab
- Click Create new sitemap > Create Sitemap
- Configure:
- Sitemap name:
sippulse-docs - Start URL:
https://docs.sippulse.ai
- Sitemap name:
Step 3: Configure Navigation Links
- Click Add new selector
- Configure the first selector to capture sidebar navigation:
- Id:
nav-links - Type:
Link - Selector: Click Select, then click on sidebar links (like "Get Started", "Agents", etc.)
- Check Multiple (to capture all links)
- Id:
- Click Done selecting and Save selector
Step 4: Create Content Selectors
- Click on
nav-linksselector, then Add new selector (child selector) - Configure title selector:
- Id:
title - Type:
Text - Selector: Click Select, then click on the page title (
h1)
- Id:
- Add another child selector for content:
- Id:
content - Type:
Text - Selector:
.vp-doc(the main content container in VitePress)
- Id:
- Add URL selector:
- Id:
page-url - Type:
Text - Selector:
_url_(special selector for current URL)
- Id:
Step 5: Run the Scraper
- Click Sitemap (sippulse-docs) > Scrape
- Set Request interval:
2000(2 seconds between requests) - Click Start scraping
- Wait for completion (the scraper will navigate through all pages)
Step 6: Export and Clean
- Click Sitemap (sippulse-docs) > Export data as CSV
- Open the CSV in Google Sheets or Excel
- Clean the data:
- Remove rows with empty
content - Remove duplicate URLs
- Rename columns to:
title,content,url
- Remove rows with empty
- Save as CSV (UTF-8 encoding)
Step 7: Import into Knowledge Base
- In SipPulse AI, go to Knowledge Base
- Click + Create Table > Upload File
- Configure:
- Name:
sippulse_docs_kb - Description: "SipPulse AI platform documentation for answering user questions about features and usage"
- Embedding Model:
text-embedding-3-large
- Name:
- Upload your CSV
- Click Save
Step 8: Test Your Knowledge Base
- Click on the created table
- Click Query
- Test with questions like:
- "How do I create an agent?"
- "What models are available for text-to-speech?"
- "How do I configure webhooks?"
- Verify the returned snippets are relevant
CSV Format Requirements
For best results, structure your CSV with these columns:
| Column | Required | Description |
|---|---|---|
content | Yes | The main text content to be vectorized |
title | Recommended | Page or section title |
url | Recommended | Source URL for reference |
category | Optional | Category or section name |
Example CSV:
title,content,url,category
"Getting Started","Welcome to our platform. This guide will help you...","https://docs.example.com/start","basics"
"API Authentication","All API requests require authentication using...","https://docs.example.com/api/auth","api"JSON Format Alternative
You can also use JSON format for more structured data:
[
{
"title": "Getting Started",
"content": "Welcome to our platform. This guide will help you...",
"url": "https://docs.example.com/start",
"category": "basics"
},
{
"title": "API Authentication",
"content": "All API requests require authentication using...",
"url": "https://docs.example.com/api/auth",
"category": "api"
}
]Best Practices
Content Quality
- Remove navigation elements: Exclude menus, footers, and sidebars from your selectors
- Keep chunks reasonable: Aim for 500-2000 words per entry for optimal semantic search
- Include context: Add titles and URLs so the agent can reference sources
Respecting Websites
- Check robots.txt: Ensure the website allows scraping
- Use delays: Set reasonable intervals between requests (2+ seconds)
- Don't overload servers: Limit concurrent requests
- Respect terms of service: Some websites prohibit automated scraping
Keeping Content Updated
Since scraped content is static, establish a routine to:
- Re-run your scraper periodically (weekly/monthly)
- Export new CSV files
- Delete and recreate the Knowledge Base table with fresh data
- Or use the Synchronize function after adding new rows
Troubleshooting
Q: The scraper isn't capturing all pages
Check your link selector depth and ensure you're following all navigation links. Increase the maximum pages limit if needed.
Q: Content has HTML tags or formatting issues
Most scrapers extract raw text, but some may include HTML. Clean the data in a spreadsheet before importing.
Q: The CSV import fails
Ensure your CSV:
- Uses UTF-8 encoding
- Has proper column headers
- Doesn't exceed the maximum file size (check platform limits)
- Has content in each row
Next Steps
After importing your website content:
- Test semantic search: Use the Query function to verify content was indexed correctly
- Connect to an agent: Add the Knowledge Base as a tool in your agent configuration
- Validate responses: Test the agent with questions your documentation should answer
For a complete walkthrough of creating an agent with Knowledge Base, see our Support Agent Tutorial.
