data sources
Data sources connect external knowledge to your agents. Index websites, documents, APIs, databases, and more — then query them with semantic search at runtime.
Supported Types
Website
Crawl and index web pages with configurable depth and patterns
✅
PDF / Document
Upload and index PDFs, Word docs, and text files
Manual
Git Repository
Index code, READMEs, and documentation from any Git repo
✅
CSV / JSON
Structured data with column mapping for embedding
✅
REST API
Fetch and index data from any HTTP endpoint
✅
SQL Database
Query and index rows from PostgreSQL, MySQL, SQLite, or MSSQL
✅
Vector Database
Connect existing Pinecone, Weaviate, Qdrant, Supabase, or Chroma stores
Live
Cloud Storage
Index files from S3, GCS, Azure Blob, or Cloudflare R2 buckets
✅
Chunking Strategies
When indexing text, ClawEngine splits content into chunks before generating embeddings:
Fixed Size
Split into chunks of N characters
Sentence
Split on sentence boundaries
Paragraph
Split on paragraph breaks
Semantic
AI-powered splitting based on topic shifts
Recursive
Hierarchical splitting — best for code
Custom
Define your own regex pattern
Embedding Models
text-embedding-3-small
OpenAI
1,536
$0.00002
text-embedding-3-large
OpenAI
3,072
$0.00013
embed-english-v3
Cohere
1,024
$0.0001
embed-multilingual-v3
Cohere
1,024
$0.0001
voyage-2
Voyage AI
1,024
$0.0001
BGE Small / Large
Local
384 / 1,024
Free
E5 Small / Large
Local
384 / 1,024
Free
Index Status
Each data source tracks its indexing state: pending → indexing → indexed. If indexing fails, the status shows failed with an error message. Stale sources (data changed since last index) show stale.
Was this helpful?