data sources

Data sources connect external knowledge to your agents. Index websites, documents, APIs, databases, and more — then query them with semantic search at runtime.

Supported Types

Type
Description
Auto-Refresh

Website

Crawl and index web pages with configurable depth and patterns

PDF / Document

Upload and index PDFs, Word docs, and text files

Manual

Git Repository

Index code, READMEs, and documentation from any Git repo

CSV / JSON

Structured data with column mapping for embedding

REST API

Fetch and index data from any HTTP endpoint

SQL Database

Query and index rows from PostgreSQL, MySQL, SQLite, or MSSQL

Vector Database

Connect existing Pinecone, Weaviate, Qdrant, Supabase, or Chroma stores

Live

Cloud Storage

Index files from S3, GCS, Azure Blob, or Cloudflare R2 buckets

Chunking Strategies

When indexing text, ClawEngine splits content into chunks before generating embeddings:

Strategy
Description

Fixed Size

Split into chunks of N characters

Sentence

Split on sentence boundaries

Paragraph

Split on paragraph breaks

Semantic

AI-powered splitting based on topic shifts

Recursive

Hierarchical splitting — best for code

Custom

Define your own regex pattern

Embedding Models

Model
Provider
Dimensions
Cost per 1K tokens

text-embedding-3-small

OpenAI

1,536

$0.00002

text-embedding-3-large

OpenAI

3,072

$0.00013

embed-english-v3

Cohere

1,024

$0.0001

embed-multilingual-v3

Cohere

1,024

$0.0001

voyage-2

Voyage AI

1,024

$0.0001

BGE Small / Large

Local

384 / 1,024

Free

E5 Small / Large

Local

384 / 1,024

Free

Index Status

Each data source tracks its indexing state: pendingindexingindexed. If indexing fails, the status shows failed with an error message. Stale sources (data changed since last index) show stale.

Was this helpful?