step 4 data sources

Connect external knowledge bases to give your agent domain-specific context beyond its training data.


Supported Source Types

Type
Description
Refresh

Website

Crawl and index web pages with configurable depth

Auto

PDF / Document

Upload PDFs, Word docs, or text files

Manual

Git Repository

Index code, README files, and documentation

Auto

CSV / JSON

Structured data files with column mapping

Auto

REST API

Fetch data from any HTTP endpoint

Auto

SQL Database

Query PostgreSQL, MySQL, SQLite, or MSSQL

Auto

Vector Database

Connect existing Pinecone, Weaviate, Qdrant, Supabase, or Chroma stores

Live

Cloud Storage

Index files from S3, GCS, Azure Blob, or Cloudflare R2

Auto


Indexing & Chunking

Each data source supports configurable:

  • Chunking strategy — Fixed size, sentence, paragraph, semantic, recursive, or custom regex

  • Chunk size — Number of characters per chunk (default varies by type)

  • Chunk overlap — Character overlap between chunks for context continuity

  • Embedding model — Choose from OpenAI, Cohere, Voyage AI, or local models (BGE, E5)


Refresh Intervals

Data sources marked "Auto" support scheduled re-indexing. Set the refresh interval in minutes, or leave at 0 for manual-only refresh.


How It's Used

Indexed data is available to the agent via the RAG Pipeline tool. When the agent receives a query, it searches the connected data sources for relevant context and includes it in its reasoning.

Was this helpful?