monoai.rag
RAG is a module that provides a high-level interface for performing semantic search queries against a vector database. It supports multiple vector database backends and embedding providers for flexible deployment scenarios.
1""" 2RAG is a module that provides a high-level interface for performing semantic search queries 3against a vector database. It supports multiple vector database backends and 4embedding providers for flexible deployment scenarios. 5""" 6 7from .rag import RAG 8from .vectordb import ChromaVectorDB 9from .documents_builder import DocumentsBuilder 10 11__all__ = ['RAG', 'ChromaVectorDB', 'DocumentsBuilder']
7class RAG: 8 """ 9 Retrieval-Augmented Generation (RAG) system for semantic search and document retrieval. 10 11 This class provides a high-level interface for performing semantic search queries 12 against a vector database. It supports multiple vector database backends and 13 embedding providers for flexible deployment scenarios. 14 15 The RAG system works by: 16 1. Converting text queries into vector embeddings 17 2. Searching the vector database for similar document embeddings 18 3. Returning the most relevant documents based on semantic similarity 19 20 Attributes: 21 _vectorizer (str): The embedding model used for vectorization 22 _db (str): Name of the vector database 23 _vector_db (ChromaVectorDB): The vector database backend 24 25 Examples: 26 -------- 27 Basic usage with default settings: 28 29 ```python 30 # Initialize RAG with a database name 31 rag = RAG(database="my_documents") 32 33 # Perform a semantic search 34 results = rag.query("What is machine learning?", k=5) 35 ``` 36 37 Using with specific embedding provider: 38 39 ```python 40 # Initialize with OpenAI embeddings 41 rag = RAG( 42 database="my_documents", 43 provider="openai", 44 vectorizer="text-embedding-ada-002" 45 ) 46 47 # Search for relevant documents 48 results = rag.query("Explain neural networks", k=10) 49 ``` 50 51 Working with different vector databases: 52 53 ```python 54 # Currently supports ChromaDB 55 rag = RAG( 56 database="my_collection", 57 vector_db="chroma", 58 provider="openai", 59 vectorizer="text-embedding-ada-002" 60 ) 61 ``` 62 63 Add RAG to a model, so that the model can use the RAG automatically to answer questions: 64 ```python 65 model = Model(provider="openai", model="gpt-4o-mini") 66 model._add_rag(RAG(database="my_documents", vector_db="chroma")) 67 ``` 68 69 """ 70 71 def __init__(self, 72 database: str, 73 provider: Optional[str] = None, 74 vectorizer: Optional[str] = None, 75 vector_db: str = "chroma"): 76 """ 77 Initialize the RAG system. 78 79 Parameters: 80 ----------- 81 database : str 82 Name of the vector database/collection to use for storage and retrieval. 83 This will be created if it doesn't exist. 84 85 provider : str, optional 86 The embedding provider to use (e.g., "openai", "anthropic", "cohere"). 87 If provided, the corresponding API key will be loaded automatically. 88 If None, the system will use default embedding settings. 89 90 vectorizer : str, optional 91 The specific embedding model to use for vectorization. 92 Examples: "text-embedding-ada-002", "text-embedding-3-small", "embed-english-v3.0" 93 If None, the provider's default model will be used. 94 95 vector_db : str, default="chroma" 96 The vector database backend to use. Currently supports: 97 - "chroma": ChromaDB (default, recommended for most use cases) 98 99 Raises: 100 ------- 101 ValueError 102 If an unsupported vector database is specified. 103 104 Examples: 105 --------- 106 ```python 107 # Minimal initialization 108 rag = RAG("my_documents") 109 110 # With OpenAI embeddings 111 rag = RAG( 112 database="research_papers", 113 provider="openai", 114 vectorizer="text-embedding-ada-002" 115 ) 116 117 # With Anthropic embeddings 118 rag = RAG( 119 database="articles", 120 provider="anthropic", 121 vectorizer="text-embedding-3-small" 122 ) 123 ``` 124 """ 125 if provider: 126 load_key(provider) 127 128 self._vectorizer = vectorizer 129 self._db = database 130 131 if vector_db == "chroma": 132 self._vector_db = ChromaVectorDB( 133 name=database, 134 vectorizer_provider=provider, 135 vectorizer_model=vectorizer 136 ) 137 else: 138 raise ValueError(f"Vector database '{vector_db}' not supported. Currently only 'chroma' is supported.") 139 140 141 def query(self, query: str, k: int = 10) -> Dict[str, Any]: 142 """ 143 Perform a semantic search query against the vector database. 144 145 This method converts the input query into a vector embedding and searches 146 the database for the most semantically similar documents. 147 148 Parameters: 149 ----------- 150 query : str 151 The text query to search for. This will be converted to a vector 152 embedding and used to find similar documents. 153 154 k : int, default=10 155 The number of most relevant documents to return. Higher values 156 return more results but may include less relevant documents. 157 158 Returns: 159 -------- 160 Dict[str, Any] 161 A dictionary containing the search results with the following structure: 162 { 163 'ids': List[List[str]] - Document IDs of the retrieved documents, 164 'documents': List[List[str]] - The actual document content, 165 'metadatas': List[List[Dict]] - Metadata for each document, 166 'distances': List[List[float]] - Similarity scores (lower = more similar) 167 } 168 169 Examples: 170 --------- 171 ```python 172 # Basic query 173 results = rag.query("What is artificial intelligence?") 174 175 # Query with more results 176 results = rag.query("Machine learning algorithms", k=20) 177 178 # Accessing results 179 for i, (doc_id, document, metadata, distance) in enumerate(zip( 180 results['ids'][0], 181 results['documents'][0], 182 results['metadatas'][0], 183 results['distances'][0] 184 )): 185 print(f"Result {i+1}:") 186 print(f" ID: {doc_id}") 187 print(f" Content: {document[:100]}...") 188 print(f" Similarity: {1 - distance:.3f}") 189 print(f" Metadata: {metadata}") 190 print() 191 ``` 192 193 Notes: 194 ------ 195 - The query is automatically converted to lowercase and processed 196 - Results are returned in order of relevance (most similar first) 197 - Distance scores are cosine distances (0 = identical, 2 = completely opposite) 198 - If fewer than k documents exist in the database, all available documents are returned 199 """ 200 return self._vector_db.query(query, k)
Retrieval-Augmented Generation (RAG) system for semantic search and document retrieval.
This class provides a high-level interface for performing semantic search queries against a vector database. It supports multiple vector database backends and embedding providers for flexible deployment scenarios.
The RAG system works by:
- Converting text queries into vector embeddings
- Searching the vector database for similar document embeddings
- Returning the most relevant documents based on semantic similarity
Attributes: _vectorizer (str): The embedding model used for vectorization _db (str): Name of the vector database _vector_db (ChromaVectorDB): The vector database backend
Examples:
Basic usage with default settings:
# Initialize RAG with a database name
rag = RAG(database="my_documents")
# Perform a semantic search
results = rag.query("What is machine learning?", k=5)
Using with specific embedding provider:
# Initialize with OpenAI embeddings
rag = RAG(
database="my_documents",
provider="openai",
vectorizer="text-embedding-ada-002"
)
# Search for relevant documents
results = rag.query("Explain neural networks", k=10)
Working with different vector databases:
# Currently supports ChromaDB
rag = RAG(
database="my_collection",
vector_db="chroma",
provider="openai",
vectorizer="text-embedding-ada-002"
)
Add RAG to a model, so that the model can use the RAG automatically to answer questions:
model = Model(provider="openai", model="gpt-4o-mini")
model._add_rag(RAG(database="my_documents", vector_db="chroma"))
71 def __init__(self, 72 database: str, 73 provider: Optional[str] = None, 74 vectorizer: Optional[str] = None, 75 vector_db: str = "chroma"): 76 """ 77 Initialize the RAG system. 78 79 Parameters: 80 ----------- 81 database : str 82 Name of the vector database/collection to use for storage and retrieval. 83 This will be created if it doesn't exist. 84 85 provider : str, optional 86 The embedding provider to use (e.g., "openai", "anthropic", "cohere"). 87 If provided, the corresponding API key will be loaded automatically. 88 If None, the system will use default embedding settings. 89 90 vectorizer : str, optional 91 The specific embedding model to use for vectorization. 92 Examples: "text-embedding-ada-002", "text-embedding-3-small", "embed-english-v3.0" 93 If None, the provider's default model will be used. 94 95 vector_db : str, default="chroma" 96 The vector database backend to use. Currently supports: 97 - "chroma": ChromaDB (default, recommended for most use cases) 98 99 Raises: 100 ------- 101 ValueError 102 If an unsupported vector database is specified. 103 104 Examples: 105 --------- 106 ```python 107 # Minimal initialization 108 rag = RAG("my_documents") 109 110 # With OpenAI embeddings 111 rag = RAG( 112 database="research_papers", 113 provider="openai", 114 vectorizer="text-embedding-ada-002" 115 ) 116 117 # With Anthropic embeddings 118 rag = RAG( 119 database="articles", 120 provider="anthropic", 121 vectorizer="text-embedding-3-small" 122 ) 123 ``` 124 """ 125 if provider: 126 load_key(provider) 127 128 self._vectorizer = vectorizer 129 self._db = database 130 131 if vector_db == "chroma": 132 self._vector_db = ChromaVectorDB( 133 name=database, 134 vectorizer_provider=provider, 135 vectorizer_model=vectorizer 136 ) 137 else: 138 raise ValueError(f"Vector database '{vector_db}' not supported. Currently only 'chroma' is supported.")
Initialize the RAG system.
Parameters:
database : str Name of the vector database/collection to use for storage and retrieval. This will be created if it doesn't exist.
provider : str, optional The embedding provider to use (e.g., "openai", "anthropic", "cohere"). If provided, the corresponding API key will be loaded automatically. If None, the system will use default embedding settings.
vectorizer : str, optional The specific embedding model to use for vectorization. Examples: "text-embedding-ada-002", "text-embedding-3-small", "embed-english-v3.0" If None, the provider's default model will be used.
vector_db : str, default="chroma" The vector database backend to use. Currently supports: - "chroma": ChromaDB (default, recommended for most use cases)
Raises:
ValueError If an unsupported vector database is specified.
Examples:
# Minimal initialization
rag = RAG("my_documents")
# With OpenAI embeddings
rag = RAG(
database="research_papers",
provider="openai",
vectorizer="text-embedding-ada-002"
)
# With Anthropic embeddings
rag = RAG(
database="articles",
provider="anthropic",
vectorizer="text-embedding-3-small"
)
141 def query(self, query: str, k: int = 10) -> Dict[str, Any]: 142 """ 143 Perform a semantic search query against the vector database. 144 145 This method converts the input query into a vector embedding and searches 146 the database for the most semantically similar documents. 147 148 Parameters: 149 ----------- 150 query : str 151 The text query to search for. This will be converted to a vector 152 embedding and used to find similar documents. 153 154 k : int, default=10 155 The number of most relevant documents to return. Higher values 156 return more results but may include less relevant documents. 157 158 Returns: 159 -------- 160 Dict[str, Any] 161 A dictionary containing the search results with the following structure: 162 { 163 'ids': List[List[str]] - Document IDs of the retrieved documents, 164 'documents': List[List[str]] - The actual document content, 165 'metadatas': List[List[Dict]] - Metadata for each document, 166 'distances': List[List[float]] - Similarity scores (lower = more similar) 167 } 168 169 Examples: 170 --------- 171 ```python 172 # Basic query 173 results = rag.query("What is artificial intelligence?") 174 175 # Query with more results 176 results = rag.query("Machine learning algorithms", k=20) 177 178 # Accessing results 179 for i, (doc_id, document, metadata, distance) in enumerate(zip( 180 results['ids'][0], 181 results['documents'][0], 182 results['metadatas'][0], 183 results['distances'][0] 184 )): 185 print(f"Result {i+1}:") 186 print(f" ID: {doc_id}") 187 print(f" Content: {document[:100]}...") 188 print(f" Similarity: {1 - distance:.3f}") 189 print(f" Metadata: {metadata}") 190 print() 191 ``` 192 193 Notes: 194 ------ 195 - The query is automatically converted to lowercase and processed 196 - Results are returned in order of relevance (most similar first) 197 - Distance scores are cosine distances (0 = identical, 2 = completely opposite) 198 - If fewer than k documents exist in the database, all available documents are returned 199 """ 200 return self._vector_db.query(query, k)
Perform a semantic search query against the vector database.
This method converts the input query into a vector embedding and searches the database for the most semantically similar documents.
Parameters:
query : str The text query to search for. This will be converted to a vector embedding and used to find similar documents.
k : int, default=10 The number of most relevant documents to return. Higher values return more results but may include less relevant documents.
Returns:
Dict[str, Any] A dictionary containing the search results with the following structure: { 'ids': List[List[str]] - Document IDs of the retrieved documents, 'documents': List[List[str]] - The actual document content, 'metadatas': List[List[Dict]] - Metadata for each document, 'distances': List[List[float]] - Similarity scores (lower = more similar) }
Examples:
# Basic query
results = rag.query("What is artificial intelligence?")
# Query with more results
results = rag.query("Machine learning algorithms", k=20)
# Accessing results
for i, (doc_id, document, metadata, distance) in enumerate(zip(
results['ids'][0],
results['documents'][0],
results['metadatas'][0],
results['distances'][0]
)):
print(f"Result {i+1}:")
print(f" ID: {doc_id}")
print(f" Content: {document[:100]}...")
print(f" Similarity: {1 - distance:.3f}")
print(f" Metadata: {metadata}")
print()
Notes:
- The query is automatically converted to lowercase and processed
- Results are returned in order of relevance (most similar first)
- Distance scores are cosine distances (0 = identical, 2 = completely opposite)
- If fewer than k documents exist in the database, all available documents are returned
173class ChromaVectorDB(_BaseVectorDB): 174 """ 175 ChromaDB implementation of the vector database interface. 176 177 This class provides a concrete implementation of the vector database 178 using ChromaDB as the backend. ChromaDB is an open-source embedding 179 database that supports persistent storage and efficient similarity search. 180 181 Features: 182 - Persistent storage of document embeddings 183 - Efficient similarity search with configurable result count 184 - Metadata storage for each document 185 - Automatic collection creation if it doesn't exist 186 - Support for custom embedding models via LiteLLM 187 188 Attributes: 189 _client (chromadb.PersistentClient): ChromaDB client instance 190 _collection (chromadb.Collection): Active collection for operations 191 192 Examples: 193 -------- 194 Basic usage: 195 196 ```python 197 # Initialize with a new collection 198 vector_db = ChromaVectorDB(name="my_documents") 199 200 # Add documents 201 documents = ["Document 1 content", "Document 2 content"] 202 metadatas = [{"source": "file1.txt"}, {"source": "file2.txt"}] 203 ids = ["doc1", "doc2"] 204 205 vector_db.add(documents, metadatas, ids) 206 207 # Search for similar documents 208 results = vector_db.query("search query", k=5) 209 ``` 210 211 Using with specific embedding model: 212 213 ```python 214 # Initialize with OpenAI embeddings 215 vector_db = ChromaVectorDB( 216 name="research_papers", 217 vectorizer_provider="openai", 218 vectorizer_model="text-embedding-ada-002" 219 ) 220 ``` 221 """ 222 223 def __init__(self, name: Optional[str] = None, 224 vectorizer_provider: Optional[str] = None, 225 vectorizer_model: Optional[str] = None): 226 """ 227 Initialize the ChromaDB vector database. 228 229 Parameters: 230 ----------- 231 name : str, optional 232 Name of the ChromaDB collection. If provided, the collection 233 will be created if it doesn't exist, or connected to if it does. 234 235 vectorizer_provider : str, optional 236 The embedding provider to use for vectorization. 237 Examples: "openai", "anthropic", "cohere" 238 239 vectorizer_model : str, optional 240 The specific embedding model to use. 241 Examples: "text-embedding-ada-002", "text-embedding-3-small" 242 243 Examples: 244 --------- 245 ```python 246 # Create new collection 247 vector_db = ChromaVectorDB("my_documents") 248 249 # Connect to existing collection with custom embeddings 250 vector_db = ChromaVectorDB( 251 name="existing_collection", 252 vectorizer_provider="openai", 253 vectorizer_model="text-embedding-ada-002" 254 ) 255 ``` 256 """ 257 super().__init__(name, vectorizer_provider, vectorizer_model) 258 self._client = chromadb.PersistentClient() 259 if name: 260 try: 261 self._collection = self._client.get_collection(name) 262 except chromadb.errors.NotFoundError: 263 self._collection = self._client.create_collection(name) 264 265 def add(self, documents: List[str], metadatas: List[Dict], ids: List[str]) -> None: 266 """ 267 Add documents to the ChromaDB collection. 268 269 This method adds documents along with their metadata and IDs to the 270 ChromaDB collection. The documents are automatically converted to 271 embeddings using the configured embedding model. 272 273 Parameters: 274 ----------- 275 documents : List[str] 276 List of text documents to add to the database. 277 Each document will be converted to a vector embedding. 278 279 metadatas : List[Dict] 280 List of metadata dictionaries for each document. 281 Each metadata dict can contain any key-value pairs for 282 document categorization and filtering. 283 284 ids : List[str] 285 List of unique identifiers for each document. 286 IDs must be unique within the collection. 287 288 Raises: 289 ------- 290 ValueError 291 If the lengths of documents, metadatas, and ids don't match. 292 293 Examples: 294 --------- 295 ```python 296 # Add documents with metadata 297 documents = [ 298 "Machine learning is a subset of artificial intelligence.", 299 "Deep learning uses neural networks with multiple layers." 300 ] 301 302 metadatas = [ 303 {"topic": "machine_learning", "source": "textbook", "year": 2023}, 304 {"topic": "deep_learning", "source": "research_paper", "year": 2023} 305 ] 306 307 ids = ["doc_001", "doc_002"] 308 309 vector_db.add(documents, metadatas, ids) 310 ``` 311 312 Notes: 313 ------ 314 - All three lists must have the same length 315 - IDs must be unique within the collection 316 - Documents are automatically embedded using the configured model 317 - Metadata can be used for filtering during queries 318 """ 319 if not (len(documents) == len(metadatas) == len(ids)): 320 raise ValueError("documents, metadatas, and ids must have the same length") 321 322 self._collection.add( 323 documents=documents, 324 metadatas=metadatas, 325 ids=ids 326 ) 327 328 def query(self, query: str, k: int = 10) -> Dict[str, Any]: 329 """ 330 Search for similar documents in the ChromaDB collection. 331 332 This method performs semantic search by converting the query to an 333 embedding and finding the most similar document embeddings in the 334 collection. 335 336 Parameters: 337 ----------- 338 query : str 339 The text query to search for. This will be converted to a 340 vector embedding and compared against stored documents. 341 342 k : int, default=10 343 Number of most similar documents to return. Higher values 344 return more results but may include less relevant documents. 345 346 Returns: 347 -------- 348 Dict[str, Any] 349 A dictionary containing search results with the following structure: 350 { 351 'ids': List[List[str]] - Document IDs of retrieved documents, 352 'documents': List[List[str]] - The actual document content, 353 'metadatas': List[List[Dict]] - Metadata for each document, 354 'distances': List[List[float]] - Similarity scores (lower = more similar) 355 } 356 357 Examples: 358 --------- 359 ```python 360 # Basic search 361 results = vector_db.query("What is machine learning?", k=5) 362 363 # Access results 364 for i, (doc_id, document, metadata, distance) in enumerate(zip( 365 results['ids'][0], 366 results['documents'][0], 367 results['metadatas'][0], 368 results['distances'][0] 369 )): 370 print(f"Result {i+1}:") 371 print(f" ID: {doc_id}") 372 print(f" Content: {document[:100]}...") 373 print(f" Similarity: {1 - distance:.3f}") 374 print(f" Metadata: {metadata}") 375 ``` 376 377 Notes: 378 ------ 379 - Results are returned in order of similarity (most similar first) 380 - Distance scores are cosine distances (0 = identical, 2 = opposite) 381 - If fewer than k documents exist, all available documents are returned 382 - The query is automatically embedded using the same model as stored documents 383 """ 384 results = self._collection.query( 385 query_texts=query, 386 n_results=k 387 ) 388 return results
ChromaDB implementation of the vector database interface.
This class provides a concrete implementation of the vector database using ChromaDB as the backend. ChromaDB is an open-source embedding database that supports persistent storage and efficient similarity search.
Features:
- Persistent storage of document embeddings
- Efficient similarity search with configurable result count
- Metadata storage for each document
- Automatic collection creation if it doesn't exist
- Support for custom embedding models via LiteLLM
Attributes: _client (chromadb.PersistentClient): ChromaDB client instance _collection (chromadb.Collection): Active collection for operations
Examples:
Basic usage:
# Initialize with a new collection
vector_db = ChromaVectorDB(name="my_documents")
# Add documents
documents = ["Document 1 content", "Document 2 content"]
metadatas = [{"source": "file1.txt"}, {"source": "file2.txt"}]
ids = ["doc1", "doc2"]
vector_db.add(documents, metadatas, ids)
# Search for similar documents
results = vector_db.query("search query", k=5)
Using with specific embedding model:
# Initialize with OpenAI embeddings
vector_db = ChromaVectorDB(
name="research_papers",
vectorizer_provider="openai",
vectorizer_model="text-embedding-ada-002"
)
223 def __init__(self, name: Optional[str] = None, 224 vectorizer_provider: Optional[str] = None, 225 vectorizer_model: Optional[str] = None): 226 """ 227 Initialize the ChromaDB vector database. 228 229 Parameters: 230 ----------- 231 name : str, optional 232 Name of the ChromaDB collection. If provided, the collection 233 will be created if it doesn't exist, or connected to if it does. 234 235 vectorizer_provider : str, optional 236 The embedding provider to use for vectorization. 237 Examples: "openai", "anthropic", "cohere" 238 239 vectorizer_model : str, optional 240 The specific embedding model to use. 241 Examples: "text-embedding-ada-002", "text-embedding-3-small" 242 243 Examples: 244 --------- 245 ```python 246 # Create new collection 247 vector_db = ChromaVectorDB("my_documents") 248 249 # Connect to existing collection with custom embeddings 250 vector_db = ChromaVectorDB( 251 name="existing_collection", 252 vectorizer_provider="openai", 253 vectorizer_model="text-embedding-ada-002" 254 ) 255 ``` 256 """ 257 super().__init__(name, vectorizer_provider, vectorizer_model) 258 self._client = chromadb.PersistentClient() 259 if name: 260 try: 261 self._collection = self._client.get_collection(name) 262 except chromadb.errors.NotFoundError: 263 self._collection = self._client.create_collection(name)
Initialize the ChromaDB vector database.
Parameters:
name : str, optional Name of the ChromaDB collection. If provided, the collection will be created if it doesn't exist, or connected to if it does.
vectorizer_provider : str, optional The embedding provider to use for vectorization. Examples: "openai", "anthropic", "cohere"
vectorizer_model : str, optional The specific embedding model to use. Examples: "text-embedding-ada-002", "text-embedding-3-small"
Examples:
# Create new collection
vector_db = ChromaVectorDB("my_documents")
# Connect to existing collection with custom embeddings
vector_db = ChromaVectorDB(
name="existing_collection",
vectorizer_provider="openai",
vectorizer_model="text-embedding-ada-002"
)
265 def add(self, documents: List[str], metadatas: List[Dict], ids: List[str]) -> None: 266 """ 267 Add documents to the ChromaDB collection. 268 269 This method adds documents along with their metadata and IDs to the 270 ChromaDB collection. The documents are automatically converted to 271 embeddings using the configured embedding model. 272 273 Parameters: 274 ----------- 275 documents : List[str] 276 List of text documents to add to the database. 277 Each document will be converted to a vector embedding. 278 279 metadatas : List[Dict] 280 List of metadata dictionaries for each document. 281 Each metadata dict can contain any key-value pairs for 282 document categorization and filtering. 283 284 ids : List[str] 285 List of unique identifiers for each document. 286 IDs must be unique within the collection. 287 288 Raises: 289 ------- 290 ValueError 291 If the lengths of documents, metadatas, and ids don't match. 292 293 Examples: 294 --------- 295 ```python 296 # Add documents with metadata 297 documents = [ 298 "Machine learning is a subset of artificial intelligence.", 299 "Deep learning uses neural networks with multiple layers." 300 ] 301 302 metadatas = [ 303 {"topic": "machine_learning", "source": "textbook", "year": 2023}, 304 {"topic": "deep_learning", "source": "research_paper", "year": 2023} 305 ] 306 307 ids = ["doc_001", "doc_002"] 308 309 vector_db.add(documents, metadatas, ids) 310 ``` 311 312 Notes: 313 ------ 314 - All three lists must have the same length 315 - IDs must be unique within the collection 316 - Documents are automatically embedded using the configured model 317 - Metadata can be used for filtering during queries 318 """ 319 if not (len(documents) == len(metadatas) == len(ids)): 320 raise ValueError("documents, metadatas, and ids must have the same length") 321 322 self._collection.add( 323 documents=documents, 324 metadatas=metadatas, 325 ids=ids 326 )
Add documents to the ChromaDB collection.
This method adds documents along with their metadata and IDs to the ChromaDB collection. The documents are automatically converted to embeddings using the configured embedding model.
Parameters:
documents : List[str] List of text documents to add to the database. Each document will be converted to a vector embedding.
metadatas : List[Dict] List of metadata dictionaries for each document. Each metadata dict can contain any key-value pairs for document categorization and filtering.
ids : List[str] List of unique identifiers for each document. IDs must be unique within the collection.
Raises:
ValueError If the lengths of documents, metadatas, and ids don't match.
Examples:
# Add documents with metadata
documents = [
"Machine learning is a subset of artificial intelligence.",
"Deep learning uses neural networks with multiple layers."
]
metadatas = [
{"topic": "machine_learning", "source": "textbook", "year": 2023},
{"topic": "deep_learning", "source": "research_paper", "year": 2023}
]
ids = ["doc_001", "doc_002"]
vector_db.add(documents, metadatas, ids)
Notes:
- All three lists must have the same length
- IDs must be unique within the collection
- Documents are automatically embedded using the configured model
- Metadata can be used for filtering during queries
328 def query(self, query: str, k: int = 10) -> Dict[str, Any]: 329 """ 330 Search for similar documents in the ChromaDB collection. 331 332 This method performs semantic search by converting the query to an 333 embedding and finding the most similar document embeddings in the 334 collection. 335 336 Parameters: 337 ----------- 338 query : str 339 The text query to search for. This will be converted to a 340 vector embedding and compared against stored documents. 341 342 k : int, default=10 343 Number of most similar documents to return. Higher values 344 return more results but may include less relevant documents. 345 346 Returns: 347 -------- 348 Dict[str, Any] 349 A dictionary containing search results with the following structure: 350 { 351 'ids': List[List[str]] - Document IDs of retrieved documents, 352 'documents': List[List[str]] - The actual document content, 353 'metadatas': List[List[Dict]] - Metadata for each document, 354 'distances': List[List[float]] - Similarity scores (lower = more similar) 355 } 356 357 Examples: 358 --------- 359 ```python 360 # Basic search 361 results = vector_db.query("What is machine learning?", k=5) 362 363 # Access results 364 for i, (doc_id, document, metadata, distance) in enumerate(zip( 365 results['ids'][0], 366 results['documents'][0], 367 results['metadatas'][0], 368 results['distances'][0] 369 )): 370 print(f"Result {i+1}:") 371 print(f" ID: {doc_id}") 372 print(f" Content: {document[:100]}...") 373 print(f" Similarity: {1 - distance:.3f}") 374 print(f" Metadata: {metadata}") 375 ``` 376 377 Notes: 378 ------ 379 - Results are returned in order of similarity (most similar first) 380 - Distance scores are cosine distances (0 = identical, 2 = opposite) 381 - If fewer than k documents exist, all available documents are returned 382 - The query is automatically embedded using the same model as stored documents 383 """ 384 results = self._collection.query( 385 query_texts=query, 386 n_results=k 387 ) 388 return results
Search for similar documents in the ChromaDB collection.
This method performs semantic search by converting the query to an embedding and finding the most similar document embeddings in the collection.
Parameters:
query : str The text query to search for. This will be converted to a vector embedding and compared against stored documents.
k : int, default=10 Number of most similar documents to return. Higher values return more results but may include less relevant documents.
Returns:
Dict[str, Any] A dictionary containing search results with the following structure: { 'ids': List[List[str]] - Document IDs of retrieved documents, 'documents': List[List[str]] - The actual document content, 'metadatas': List[List[Dict]] - Metadata for each document, 'distances': List[List[float]] - Similarity scores (lower = more similar) }
Examples:
# Basic search
results = vector_db.query("What is machine learning?", k=5)
# Access results
for i, (doc_id, document, metadata, distance) in enumerate(zip(
results['ids'][0],
results['documents'][0],
results['metadatas'][0],
results['distances'][0]
)):
print(f"Result {i+1}:")
print(f" ID: {doc_id}")
print(f" Content: {document[:100]}...")
print(f" Similarity: {1 - distance:.3f}")
print(f" Metadata: {metadata}")
Notes:
- Results are returned in order of similarity (most similar first)
- Distance scores are cosine distances (0 = identical, 2 = opposite)
- If fewer than k documents exist, all available documents are returned
- The query is automatically embedded using the same model as stored documents
25class DocumentsBuilder: 26 """ 27 A utility class for building document collections from various sources. 28 29 This class provides methods to extract text content from files and web pages, 30 split the content into manageable chunks with configurable size and overlap, 31 and prepare the data for storage in vector databases. 32 33 The DocumentsBuilder is designed to work seamlessly with the RAG system, 34 producing output that can be directly used with vector database operations. 35 36 Features: 37 - File-based document extraction with UTF-8 encoding support 38 - Text string processing for in-memory content 39 - Web scraping with multiple engine options (requests, tavily, selenium) 40 - Word document extraction (.doc and .docx formats) 41 - PDF document extraction with metadata 42 - Multiple chunking strategies (word, sentence, paragraph, fixed, semantic) 43 - Configurable chunk size and overlap parameters 44 - Rich metadata generation for each document chunk 45 - Unique ID generation for database storage 46 47 Attributes: 48 _chunk_strategy (str): The chunking strategy to use 49 _chunk_size (int): Maximum size of each text chunk in characters 50 _chunk_overlap (int): Number of characters to overlap between chunks 51 52 Examples: 53 -------- 54 Basic usage with file processing: 55 56 ```python 57 # Initialize with default chunk settings (word-based) 58 builder = DocumentsBuilder() 59 60 # Process a text file 61 documents, metadatas, ids = builder.from_file("document.txt") 62 63 # Add to vector database 64 vector_db.add(documents, metadatas, ids) 65 ``` 66 67 Text string processing: 68 69 ```python 70 # Process text strings directly 71 text_content = "This is a long text that needs to be processed..." 72 documents, metadatas, ids = builder.from_str(text_content) 73 74 # Process with custom source name 75 documents, metadatas, ids = builder.from_str( 76 text_content, 77 source_name="user_input" 78 ) 79 ``` 80 81 Different chunking strategies: 82 83 ```python 84 # Default settings (word-based chunking) 85 builder = DocumentsBuilder() 86 87 # Sentence-based chunking (5 sentences per chunk) 88 builder = DocumentsBuilder(chunk_strategy="sentence", chunk_size=5) 89 90 # Paragraph-based chunking (3 paragraphs per chunk) 91 builder = DocumentsBuilder(chunk_strategy="paragraph", chunk_size=3) 92 93 # Fixed-size chunks (800 characters per chunk) 94 builder = DocumentsBuilder(chunk_strategy="fixed", chunk_size=800) 95 96 # Word-based chunks (50 words per chunk) 97 builder = DocumentsBuilder(chunk_strategy="word", chunk_size=50) 98 ``` 99 100 Web scraping with different engines: 101 102 ```python 103 # Basic web scraping 104 documents, metadatas, ids = builder.from_url("https://example.com") 105 106 # Advanced scraping with Tavily 107 documents, metadatas, ids = builder.from_url( 108 "https://example.com", 109 engine="tavily", 110 deep=True 111 ) 112 113 # JavaScript-heavy sites with Selenium 114 documents, metadatas, ids = builder.from_url( 115 "https://spa-example.com", 116 engine="selenium" 117 ) 118 ``` 119 120 Word document processing: 121 122 ```python 123 # Process Word documents 124 documents, metadatas, ids = builder.from_doc("document.docx") 125 documents, metadatas, ids = builder.from_doc("document.doc") 126 127 # Process with custom extraction method 128 documents, metadatas, ids = builder.from_doc( 129 "document.docx", 130 extraction_method="docx2txt" 131 ) 132 ``` 133 134 PDF document processing: 135 136 ```python 137 # Process PDF documents 138 documents, metadatas, ids = builder.from_pdf("document.pdf") 139 140 # Process with page range 141 documents, metadatas, ids = builder.from_pdf( 142 "document.pdf", 143 page_range=(1, 10) # Extract pages 1-10 144 ) 145 ``` 146 147 Notes: 148 ------ 149 - chunk_overlap should typically be 10-20% of chunk_size 150 - chunk_overlap must be less than chunk_size to prevent infinite loops 151 - Different strategies interpret chunk_size differently: 152 * word: chunk_size = number of words per chunk 153 * sentence: chunk_size = number of sentences per chunk 154 * paragraph: chunk_size = number of paragraphs per chunk 155 * fixed: chunk_size = number of characters per chunk 156 * semantic: chunk_size = number of characters per chunk 157 - Very small chunks may lose context 158 - Very large chunks may be less focused for retrieval 159 - Fixed and semantic strategies always produce chunks of exactly chunk_size (except the last one) 160 - Word document processing requires python-docx and python-docx2txt packages 161 - PDF processing requires PyPDF2 package 162 """ 163 164 def __init__(self, chunk_strategy: str = "word", chunk_size: int = 1000, chunk_overlap: int = 0, custom_split_func: Optional[callable] = None): 165 """ 166 Initialize the DocumentsBuilder with chunking parameters. 167 168 Parameters: 169 ----------- 170 chunk_strategy : str, default="word" 171 The strategy to use for text chunking: 172 - "word": Break at word boundaries (spaces and newlines) when possible 173 - "sentence": Break at sentence boundaries (periods, exclamation marks, question marks) 174 - "paragraph": Break at paragraph boundaries (double newlines) 175 - "fixed": Break at exact character count without considering boundaries 176 - "semantic": Break at semantic boundaries (headers, sections, etc.) 177 - "custom": Use the provided custom_split_func for chunking 178 179 chunk_size : int, default=1000 180 The size limit for each chunk, interpreted differently based on strategy: 181 - "word": Maximum number of words per chunk 182 - "sentence": Maximum number of sentences per chunk 183 - "paragraph": Maximum number of paragraphs per chunk 184 - "fixed": Maximum number of characters per chunk 185 - "semantic": Maximum number of characters per chunk 186 - "custom": Passed to custom_split_func as a parameter 187 188 chunk_overlap : int, default=0 189 The overlap between consecutive chunks, interpreted based on strategy: 190 - "word": Number of words to overlap 191 - "sentence": Number of sentences to overlap 192 - "paragraph": Number of paragraphs to overlap 193 - "fixed": Number of characters to overlap 194 - "semantic": Number of characters to overlap 195 - "custom": Passed to custom_split_func as a parameter 196 197 custom_split_func : callable, optional 198 Custom function to use for text splitting. If provided, automatically sets chunk_strategy to "custom" 199 regardless of the chunk_strategy parameter value. 200 The function should have the signature: func(text: str, chunk_size: int, chunk_overlap: int) -> List[str] 201 and return a list of text chunks. 202 203 Raises: 204 ------- 205 ValueError 206 If chunk_overlap >= chunk_size (would cause infinite loops) 207 If chunk_size <= 0 208 If chunk_overlap < 0 209 If chunk_strategy="custom" but no custom_split_func is provided 210 211 Examples: 212 --------- 213 ```python 214 # Default settings (word-based chunking) 215 builder = DocumentsBuilder() 216 217 # Sentence-based chunking (5 sentences per chunk) 218 builder = DocumentsBuilder(chunk_strategy="sentence", chunk_size=5) 219 220 # Paragraph-based chunking (3 paragraphs per chunk) 221 builder = DocumentsBuilder(chunk_strategy="paragraph", chunk_size=3) 222 223 # Fixed-size chunks (800 characters per chunk) 224 builder = DocumentsBuilder(chunk_strategy="fixed", chunk_size=800) 225 226 # Word-based chunks (50 words per chunk) 227 builder = DocumentsBuilder(chunk_strategy="word", chunk_size=50) 228 229 # Custom chunking function 230 def my_custom_split(text, chunk_size, chunk_overlap): 231 # Split by lines and then by chunk_size 232 lines = text.split('\n') 233 chunks = [] 234 for i in range(0, len(lines), chunk_size - chunk_overlap): 235 chunk_lines = lines[i:i + chunk_size] 236 chunks.append('\n'.join(chunk_lines)) 237 return chunks 238 239 # Strategy automatically set to "custom" when custom_split_func is provided 240 builder = DocumentsBuilder( 241 chunk_size=100, 242 chunk_overlap=10, 243 custom_split_func=my_custom_split 244 ) 245 246 # Or explicitly set strategy (will be overridden to "custom") 247 builder = DocumentsBuilder( 248 chunk_strategy="word", # This will be ignored 249 chunk_size=100, 250 chunk_overlap=10, 251 custom_split_func=my_custom_split # Strategy becomes "custom" 252 ) 253 ``` 254 255 Notes: 256 ------ 257 - chunk_overlap should typically be 10-20% of chunk_size 258 - chunk_overlap must be less than chunk_size to prevent infinite loops 259 - Different strategies interpret chunk_size differently: 260 * word: chunk_size = number of words per chunk 261 * sentence: chunk_size = number of sentences per chunk 262 * paragraph: chunk_size = number of paragraphs per chunk 263 * fixed: chunk_size = number of characters per chunk 264 * semantic: chunk_size = number of characters per chunk 265 * custom: chunk_size is passed to custom_split_func 266 - Very small chunks may lose context 267 - Very large chunks may be less focused for retrieval 268 - Fixed and semantic strategies always produce chunks of exactly chunk_size (except the last one) 269 - Custom functions should handle their own overlap logic 270 - Custom functions can implement any splitting logic: 271 * Split by specific delimiters (e.g., "---", "###") 272 * Split by regex patterns 273 * Split by semantic boundaries using NLP libraries 274 * Split by document structure (headers, sections, etc.) 275 * Combine multiple strategies 276 - When custom_split_func is provided, chunk_strategy is automatically set to "custom" 277 regardless of the chunk_strategy parameter value 278 """ 279 # If custom_split_func is provided, automatically set strategy to "custom" 280 if custom_split_func is not None: 281 chunk_strategy = "custom" 282 283 self._chunk_strategy = chunk_strategy 284 self._chunk_size = chunk_size 285 self._chunk_overlap = chunk_overlap 286 self._custom_split_func = custom_split_func 287 288 # Validate parameters to prevent infinite loops 289 if chunk_overlap >= chunk_size: 290 raise ValueError( 291 f"chunk_overlap ({chunk_overlap}) must be less than chunk_size ({chunk_size}) " 292 "to prevent infinite loops. Recommended: chunk_overlap should be 10-20% of chunk_size." 293 ) 294 295 if chunk_size <= 0: 296 raise ValueError(f"chunk_size must be positive, got {chunk_size}") 297 298 if chunk_overlap < 0: 299 raise ValueError(f"chunk_overlap must be non-negative, got {chunk_overlap}") 300 301 # Validate custom split function 302 if chunk_strategy == "custom" and custom_split_func is None: 303 raise ValueError("custom_split_func must be provided when chunk_strategy='custom'") 304 305 if custom_split_func is not None and not callable(custom_split_func): 306 raise ValueError("custom_split_func must be callable") 307 308 def from_file(self, file_path: str) -> Tuple[List[str], List[Dict], List[str]]: 309 """ 310 Read a file and split it into chunks with specified size and overlap. 311 312 This method reads a text file from the filesystem, splits its content 313 into chunks according to the configured parameters, and generates 314 metadata and unique IDs for each chunk. 315 316 Parameters: 317 ----------- 318 file_path : str 319 Path to the text file to read. The file must exist and be 320 readable. UTF-8 encoding is assumed. 321 322 Returns: 323 -------- 324 Tuple[List[str], List[Dict], List[str]] 325 A tuple containing: 326 - List of document chunks (strings): The text content split into chunks 327 - List of metadata dictionaries: Metadata for each chunk including 328 file information and chunk details 329 - List of unique IDs: UUID strings for each chunk 330 331 Raises: 332 ------- 333 FileNotFoundError 334 If the specified file does not exist or is not accessible. 335 336 UnicodeDecodeError 337 If the file cannot be decoded as UTF-8. 338 339 Examples: 340 --------- 341 ```python 342 # Process a single file 343 documents, metadatas, ids = builder.from_file("article.txt") 344 345 # Access metadata information 346 for i, metadata in enumerate(metadatas): 347 print(f"Chunk {i+1}:") 348 print(f" File: {metadata['file_name']}") 349 print(f" Size: {metadata['chunk_size']} characters") 350 print(f" Position: {metadata['chunk_index'] + 1}/{metadata['total_chunks']}") 351 ``` 352 353 Notes: 354 ------ 355 - File is read entirely into memory before processing 356 - Empty files will return empty lists 357 - File path is stored in metadata for traceability 358 - Chunk indexing starts at 0 359 """ 360 if not os.path.exists(file_path): 361 raise FileNotFoundError(f"File not found: {file_path}") 362 363 # Read the file content 364 with open(file_path, 'r', encoding='utf-8') as file: 365 text = file.read() 366 367 # Split text into chunks 368 chunks = self._split_text(text) 369 370 # Generate metadata and IDs for each chunk 371 documents = [] 372 metadatas = [] 373 ids = [] 374 375 for i, chunk in enumerate(chunks): 376 # Generate unique ID 377 chunk_id = str(uuid.uuid4()) 378 379 # Create metadata 380 metadata = { 381 'file_path': file_path, 382 'file_name': os.path.basename(file_path), 383 'chunk_index': i, 384 'total_chunks': len(chunks), 385 'chunk_size': len(chunk) 386 } 387 388 documents.append(chunk) 389 metadatas.append(metadata) 390 ids.append(chunk_id) 391 392 return documents, metadatas, ids 393 394 def from_str(self, text: str, source_name: str = "text_string") -> Tuple[List[str], List[Dict], List[str]]: 395 """ 396 Process a text string and split it into chunks with specified size and overlap. 397 398 This method takes a text string directly and processes it using the same 399 chunking logic as file processing. It's useful when you already have 400 text content in memory and want to prepare it for vector database storage. 401 402 Parameters: 403 ----------- 404 text : str 405 The text content to process and split into chunks. 406 407 source_name : str, default="text_string" 408 A descriptive name for the text source. This will be included 409 in the metadata for traceability and identification. 410 411 Returns: 412 -------- 413 Tuple[List[str], List[Dict], List[str]] 414 A tuple containing: 415 - List of document chunks (strings): The text content split into chunks 416 - List of metadata dictionaries: Metadata for each chunk including 417 source information and chunk details 418 - List of unique IDs: UUID strings for each chunk 419 420 Examples: 421 --------- 422 ```python 423 # Process a simple text string 424 text_content = "This is a long text that needs to be processed..." 425 documents, metadatas, ids = builder.from_str(text_content) 426 427 # Process with custom source name 428 documents, metadatas, ids = builder.from_str( 429 text_content, 430 source_name="user_input" 431 ) 432 433 # Process multiple text strings 434 text_parts = [ 435 "First part of the document...", 436 "Second part of the document...", 437 "Third part of the document..." 438 ] 439 440 all_documents = [] 441 all_metadatas = [] 442 all_ids = [] 443 444 for i, text_part in enumerate(text_parts): 445 documents, metadatas, ids = builder.from_str( 446 text_part, 447 source_name=f"document_part_{i+1}" 448 ) 449 all_documents.extend(documents) 450 all_metadatas.extend(metadatas) 451 all_ids.extend(ids) 452 ``` 453 454 Notes: 455 ------ 456 - Uses the same chunking strategy and parameters as other methods 457 - Empty strings will return empty lists 458 - Source name is stored in metadata for identification 459 - Useful for processing text from APIs, user input, or generated content 460 """ 461 if not text or not text.strip(): 462 return [], [], [] 463 464 # Split text into chunks 465 chunks = self._split_text(text) 466 467 # Generate metadata and IDs for each chunk 468 documents = [] 469 metadatas = [] 470 ids = [] 471 472 for i, chunk in enumerate(chunks): 473 # Generate unique ID 474 chunk_id = str(uuid.uuid4()) 475 476 # Create metadata 477 metadata = { 478 'source_type': 'text_string', 479 'source_name': source_name, 480 'chunk_index': i, 481 'total_chunks': len(chunks), 482 'chunk_size': len(chunk), 483 'chunk_strategy': self._chunk_strategy 484 } 485 486 documents.append(chunk) 487 metadatas.append(metadata) 488 ids.append(chunk_id) 489 490 return documents, metadatas, ids 491 492 def from_doc(self, file_path: str, extraction_method: str = "auto") -> Tuple[List[str], List[Dict], List[str]]: 493 """ 494 Extract text from Word documents (.doc and .docx files) and split into chunks. 495 496 This method supports both .doc and .docx formats using different extraction 497 methods. For .docx files, it can use either python-docx or docx2txt libraries. 498 For .doc files, it uses docx2txt which can handle the older format. 499 500 Parameters: 501 ----------- 502 file_path : str 503 Path to the Word document (.doc or .docx file). The file must exist 504 and be readable. 505 506 extraction_method : str, default="auto" 507 The method to use for text extraction: 508 - "auto": Automatically choose the best method based on file extension 509 - "docx": Use python-docx library (only for .docx files) 510 - "docx2txt": Use docx2txt library (works for both .doc and .docx) 511 512 Returns: 513 -------- 514 Tuple[List[str], List[Dict], List[str]] 515 A tuple containing: 516 - List of document chunks (strings): The extracted text split into chunks 517 - List of metadata dictionaries: Metadata for each chunk including 518 file information, document properties, and chunk details 519 - List of unique IDs: UUID strings for each chunk 520 521 Raises: 522 ------- 523 FileNotFoundError 524 If the specified file does not exist or is not accessible. 525 526 ValueError 527 If the file is not a supported Word document format or if the 528 required extraction method is not available. 529 530 ImportError 531 If the required libraries for the chosen extraction method are not installed. 532 533 Examples: 534 --------- 535 ```python 536 # Process a .docx file with automatic method selection 537 documents, metadatas, ids = builder.from_doc("document.docx") 538 539 # Process a .doc file 540 documents, metadatas, ids = builder.from_doc("document.doc") 541 542 # Force specific extraction method 543 documents, metadatas, ids = builder.from_doc( 544 "document.docx", 545 extraction_method="docx2txt" 546 ) 547 548 # Access document metadata 549 for metadata in metadatas: 550 print(f"File: {metadata['file_name']}") 551 print(f"Format: {metadata['document_format']}") 552 print(f"Extraction method: {metadata['extraction_method']}") 553 ``` 554 555 Notes: 556 ------ 557 - .docx files are the modern Word format (Office 2007+) 558 - .doc files are the legacy Word format (Office 97-2003) 559 - python-docx provides better structure preservation for .docx files 560 - docx2txt works with both formats but may lose some formatting 561 - Document properties (title, author, etc.) are extracted when available 562 - Images and complex formatting are not preserved in the extracted text 563 """ 564 if not os.path.exists(file_path): 565 raise FileNotFoundError(f"File not found: {file_path}") 566 567 # Determine file extension and validate 568 file_extension = os.path.splitext(file_path)[1].lower() 569 if file_extension not in ['.doc', '.docx']: 570 raise ValueError(f"Unsupported file format: {file_extension}. Only .doc and .docx files are supported.") 571 572 # Determine extraction method 573 if extraction_method == "auto": 574 if file_extension == '.docx' and DOCX_AVAILABLE: 575 extraction_method = "docx" 576 else: 577 extraction_method = "docx2txt" 578 579 # Extract text based on method 580 if extraction_method == "docx": 581 if not DOCX_AVAILABLE: 582 raise ImportError("python-docx library is required for 'docx' extraction method. Install with: pip install python-docx") 583 if file_extension != '.docx': 584 raise ValueError("'docx' extraction method only supports .docx files") 585 text, doc_properties = self._extract_with_docx(file_path) 586 elif extraction_method == "docx2txt": 587 if not DOCX2TXT_AVAILABLE: 588 raise ImportError("docx2txt library is required for 'docx2txt' extraction method. Install with: pip install python-docx2txt") 589 text, doc_properties = self._extract_with_docx2txt(file_path) 590 else: 591 raise ValueError(f"Unsupported extraction method: {extraction_method}") 592 593 # Split text into chunks 594 chunks = self._split_text(text) 595 596 # Generate metadata and IDs for each chunk 597 documents = [] 598 metadatas = [] 599 ids = [] 600 601 for i, chunk in enumerate(chunks): 602 # Generate unique ID 603 chunk_id = str(uuid.uuid4()) 604 605 # Create metadata 606 metadata = { 607 'file_path': file_path, 608 'file_name': os.path.basename(file_path), 609 'document_format': file_extension[1:], # Remove the dot 610 'extraction_method': extraction_method, 611 'chunk_index': i, 612 'total_chunks': len(chunks), 613 'chunk_size': len(chunk) 614 } 615 616 # Add document properties if available 617 if doc_properties: 618 metadata.update(doc_properties) 619 620 documents.append(chunk) 621 metadatas.append(metadata) 622 ids.append(chunk_id) 623 624 return documents, metadatas, ids 625 626 def from_pdf(self, file_path: str, page_range: Optional[Tuple[int, int]] = None) -> Tuple[List[str], List[Dict], List[str]]: 627 """ 628 Extract text from PDF documents and split into chunks. 629 630 This method extracts text content from PDF files using PyPDF2 library. 631 It supports extracting all pages or a specific range of pages, and 632 preserves page information in the metadata. 633 634 Parameters: 635 ----------- 636 file_path : str 637 Path to the PDF file. The file must exist and be readable. 638 639 page_range : Tuple[int, int], optional 640 Range of pages to extract (start_page, end_page), where pages are 641 1-indexed. If None, all pages are extracted. 642 Example: (1, 5) extracts pages 1 through 5. 643 644 Returns: 645 -------- 646 Tuple[List[str], List[Dict], List[str]] 647 A tuple containing: 648 - List of document chunks (strings): The extracted text split into chunks 649 - List of metadata dictionaries: Metadata for each chunk including 650 file information, PDF properties, page information, and chunk details 651 - List of unique IDs: UUID strings for each chunk 652 653 Raises: 654 ------- 655 FileNotFoundError 656 If the specified file does not exist or is not accessible. 657 658 ValueError 659 If the file is not a valid PDF or if the page range is invalid. 660 661 ImportError 662 If PyPDF2 library is not installed. 663 664 Examples: 665 --------- 666 ```python 667 # Process entire PDF 668 documents, metadatas, ids = builder.from_pdf("document.pdf") 669 670 # Process specific page range 671 documents, metadatas, ids = builder.from_pdf( 672 "document.pdf", 673 page_range=(1, 10) # Pages 1-10 674 ) 675 676 # Process single page 677 documents, metadatas, ids = builder.from_pdf( 678 "document.pdf", 679 page_range=(5, 5) # Only page 5 680 ) 681 682 # Access PDF metadata 683 for metadata in metadatas: 684 print(f"File: {metadata['file_name']}") 685 print(f"Page: {metadata.get('page_number', 'N/A')}") 686 print(f"Total pages: {metadata.get('total_pages', 'N/A')}") 687 print(f"PDF title: {metadata.get('pdf_title', 'N/A')}") 688 ``` 689 690 Notes: 691 ------ 692 - Page numbers are 1-indexed (first page is page 1) 693 - Text extraction quality depends on the PDF structure 694 - Scanned PDFs may not extract text properly 695 - PDF metadata (title, author, etc.) is extracted when available 696 - Page information is preserved in chunk metadata 697 - Images and complex formatting are not preserved 698 """ 699 if not PDF_AVAILABLE: 700 raise ImportError("PyPDF2 library is required for PDF processing. Install with: pip install PyPDF2") 701 702 if not os.path.exists(file_path): 703 raise FileNotFoundError(f"File not found: {file_path}") 704 705 # Validate file extension 706 file_extension = os.path.splitext(file_path)[1].lower() 707 if file_extension != '.pdf': 708 raise ValueError(f"Unsupported file format: {file_extension}. Only .pdf files are supported.") 709 710 # Extract text and metadata from PDF 711 text, pdf_properties, page_info = self._extract_from_pdf(file_path, page_range) 712 713 # Split text into chunks 714 chunks = self._split_text(text) 715 716 # Generate metadata and IDs for each chunk 717 documents = [] 718 metadatas = [] 719 ids = [] 720 721 for i, chunk in enumerate(chunks): 722 # Generate unique ID 723 chunk_id = str(uuid.uuid4()) 724 725 # Create metadata 726 metadata = { 727 'file_path': file_path, 728 'file_name': os.path.basename(file_path), 729 'document_format': 'pdf', 730 'chunk_index': i, 731 'total_chunks': len(chunks), 732 'chunk_size': len(chunk) 733 } 734 735 # Add PDF properties if available 736 if pdf_properties: 737 metadata.update(pdf_properties) 738 739 # Add page information if available 740 if page_info: 741 metadata.update(page_info) 742 743 documents.append(chunk) 744 metadatas.append(metadata) 745 ids.append(chunk_id) 746 747 return documents, metadatas, ids 748 749 def from_url(self, url: str, engine: str = "requests", deep: bool = False) -> Tuple[List[str], List[Dict], List[str]]: 750 """ 751 Scrape content from a URL and split it into chunks with specified size and overlap. 752 753 This method uses web scraping to extract text content from a webpage, 754 then processes the content using the same chunking logic as file processing. 755 Multiple scraping engines are supported for different types of websites. 756 757 Parameters: 758 ----------- 759 url : str 760 The URL to scrape. Must be a valid HTTP/HTTPS URL. 761 762 engine : str, default="requests" 763 The web scraping engine to use: 764 - "requests": Simple HTTP requests (fast, good for static content) 765 - "tavily": Advanced web scraping with better content extraction 766 - "selenium": Full browser automation (good for JavaScript-heavy sites) 767 768 deep : bool, default=False 769 If using the "tavily" engine, whether to use advanced extraction mode. 770 Deep extraction provides better content quality but is slower. 771 772 Returns: 773 -------- 774 Tuple[List[str], List[Dict], List[str]] 775 A tuple containing: 776 - List of document chunks (strings): The scraped text split into chunks 777 - List of metadata dictionaries: Metadata for each chunk including 778 URL information and scraping details 779 - List of unique IDs: UUID strings for each chunk 780 781 Raises: 782 ------- 783 ValueError 784 If the scraping fails or no text content is extracted. 785 786 Examples: 787 --------- 788 ```python 789 # Basic web scraping 790 documents, metadatas, ids = builder.from_url("https://example.com") 791 792 # Advanced scraping with Tavily 793 documents, metadatas, ids = builder.from_url( 794 "https://blog.example.com", 795 engine="tavily", 796 deep=True 797 ) 798 799 # JavaScript-heavy site with Selenium 800 documents, metadatas, ids = builder.from_url( 801 "https://spa.example.com", 802 engine="selenium" 803 ) 804 805 # Access scraping metadata 806 for metadata in metadatas: 807 print(f"Source: {metadata['url']}") 808 print(f"Engine: {metadata['scraping_engine']}") 809 print(f"Deep extraction: {metadata['deep_extraction']}") 810 ``` 811 812 Notes: 813 ------ 814 - Scraping may take time depending on the engine and website complexity 815 - Some websites may block automated scraping 816 - Selenium requires Chrome/Chromium to be installed 817 - Tavily requires an API key to be configured 818 """ 819 # Initialize WebScraping with specified engine 820 scraper = WebScraping(engine=engine, deep=deep) 821 822 # Scrape the URL 823 result = scraper.scrape(url) 824 825 if not result or not result.get("text"): 826 raise ValueError(f"Failed to extract text content from URL: {url}") 827 828 text = result["text"] 829 830 # Split text into chunks 831 chunks = self._split_text(text) 832 833 # Generate metadata and IDs for each chunk 834 documents = [] 835 metadatas = [] 836 ids = [] 837 838 for i, chunk in enumerate(chunks): 839 # Generate unique ID 840 chunk_id = str(uuid.uuid4()) 841 842 # Create metadata 843 metadata = { 844 'url': url, 845 'source_type': 'web_page', 846 'scraping_engine': engine, 847 'deep_extraction': deep, 848 'chunk_index': i, 849 'total_chunks': len(chunks), 850 'chunk_size': len(chunk) 851 } 852 853 documents.append(chunk) 854 metadatas.append(metadata) 855 ids.append(chunk_id) 856 857 return documents, metadatas, ids 858 859 def _extract_with_docx(self, file_path: str) -> Tuple[str, Dict]: 860 """ 861 Extract text from a .docx file using python-docx library. 862 863 Parameters: 864 ----------- 865 file_path : str 866 Path to the .docx file 867 868 Returns: 869 -------- 870 Tuple[str, Dict] 871 A tuple containing the extracted text and document properties 872 """ 873 doc = Document(file_path) 874 875 # Extract text from paragraphs 876 text_parts = [] 877 for paragraph in doc.paragraphs: 878 if paragraph.text.strip(): 879 text_parts.append(paragraph.text) 880 881 # Extract text from tables 882 for table in doc.tables: 883 for row in table.rows: 884 row_text = [] 885 for cell in row.cells: 886 if cell.text.strip(): 887 row_text.append(cell.text.strip()) 888 if row_text: 889 text_parts.append(" | ".join(row_text)) 890 891 text = "\n\n".join(text_parts) 892 893 # Extract document properties 894 properties = {} 895 core_props = doc.core_properties 896 if core_props.title: 897 properties['document_title'] = core_props.title 898 if core_props.author: 899 properties['document_author'] = core_props.author 900 if core_props.subject: 901 properties['document_subject'] = core_props.subject 902 if core_props.created: 903 properties['document_created'] = str(core_props.created) 904 if core_props.modified: 905 properties['document_modified'] = str(core_props.modified) 906 907 return text, properties 908 909 def _extract_with_docx2txt(self, file_path: str) -> Tuple[str, Dict]: 910 """ 911 Extract text from a Word document (.doc or .docx) using docx2txt library. 912 913 Parameters: 914 ----------- 915 file_path : str 916 Path to the Word document 917 918 Returns: 919 -------- 920 Tuple[str, Dict] 921 A tuple containing the extracted text and document properties 922 """ 923 text = docx2txt.process(file_path) 924 925 # docx2txt doesn't provide document properties, so return empty dict 926 properties = {} 927 928 return text, properties 929 930 def _extract_from_pdf(self, file_path: str, page_range: Optional[Tuple[int, int]] = None) -> Tuple[str, Dict, Dict]: 931 """ 932 Extract text and metadata from a PDF file using PyPDF2. 933 934 Parameters: 935 ----------- 936 file_path : str 937 Path to the PDF file 938 939 page_range : Tuple[int, int], optional 940 Range of pages to extract (start_page, end_page), 1-indexed 941 942 Returns: 943 -------- 944 Tuple[str, Dict, Dict] 945 A tuple containing the extracted text, PDF properties, and page information 946 """ 947 with open(file_path, 'rb') as file: 948 pdf_reader = PyPDF2.PdfReader(file) 949 950 # Get total number of pages 951 total_pages = len(pdf_reader.pages) 952 953 # Determine page range 954 if page_range is None: 955 start_page = 1 956 end_page = total_pages 957 else: 958 start_page, end_page = page_range 959 # Validate page range 960 if start_page < 1 or end_page > total_pages or start_page > end_page: 961 raise ValueError(f"Invalid page range: {page_range}. Pages must be between 1 and {total_pages}") 962 963 # Extract text from specified pages 964 text_parts = [] 965 for page_num in range(start_page - 1, end_page): # Convert to 0-indexed 966 page = pdf_reader.pages[page_num] 967 page_text = page.extract_text() 968 if page_text.strip(): 969 text_parts.append(page_text) 970 971 text = "\n\n".join(text_parts) 972 973 # Extract PDF properties 974 properties = {} 975 if pdf_reader.metadata: 976 metadata = pdf_reader.metadata 977 if '/Title' in metadata: 978 properties['pdf_title'] = metadata['/Title'] 979 if '/Author' in metadata: 980 properties['pdf_author'] = metadata['/Author'] 981 if '/Subject' in metadata: 982 properties['pdf_subject'] = metadata['/Subject'] 983 if '/Creator' in metadata: 984 properties['pdf_creator'] = metadata['/Creator'] 985 if '/Producer' in metadata: 986 properties['pdf_producer'] = metadata['/Producer'] 987 if '/CreationDate' in metadata: 988 properties['pdf_creation_date'] = str(metadata['/CreationDate']) 989 if '/ModDate' in metadata: 990 properties['pdf_modification_date'] = str(metadata['/ModDate']) 991 992 # Add page information 993 page_info = { 994 'total_pages': total_pages, 995 'extracted_pages_start': start_page, 996 'extracted_pages_end': end_page, 997 'extracted_pages_count': end_page - start_page + 1 998 } 999 1000 return text, properties, page_info 1001 1002 def _split_text(self, text: str) -> List[str]: 1003 """ 1004 Split text into chunks using the specified chunking strategy. 1005 1006 This private method implements different text chunking algorithms based 1007 on the configured chunk_strategy. It supports word, sentence, paragraph, 1008 fixed, and semantic chunking strategies. 1009 1010 Parameters: 1011 ----------- 1012 text : str 1013 The text content to split into chunks. 1014 1015 Returns: 1016 -------- 1017 List[str] 1018 List of text chunks based on the selected strategy. 1019 Empty chunks are automatically filtered out. 1020 1021 Examples: 1022 --------- 1023 ```python 1024 # Internal usage (called by from_file, from_doc, from_pdf, and from_url) 1025 chunks = builder._split_text("This is a long text that needs to be split...") 1026 print(f"Created {len(chunks)} chunks using {builder._chunk_strategy} strategy") 1027 ``` 1028 1029 Notes: 1030 ------ 1031 - Chunks are stripped of leading/trailing whitespace 1032 - Empty chunks are automatically filtered out 1033 - Different strategies have different characteristics: 1034 * word: Preserves word boundaries, good for general use 1035 * sentence: Preserves sentence context, good for Q&A 1036 * paragraph: Preserves paragraph context, good for document structure 1037 * fixed: Exact size control, may break words/sentences 1038 * semantic: Attempts to preserve semantic meaning 1039 """ 1040 if len(text) <= self._chunk_size: 1041 return [text] 1042 1043 if self._chunk_strategy == "word": 1044 return self._split_by_words(text) 1045 elif self._chunk_strategy == "sentence": 1046 return self._split_by_sentences(text) 1047 elif self._chunk_strategy == "paragraph": 1048 return self._split_by_paragraphs(text) 1049 elif self._chunk_strategy == "fixed": 1050 return self._split_fixed(text) 1051 elif self._chunk_strategy == "semantic": 1052 return self._split_semantic(text) 1053 elif self._chunk_strategy == "custom": 1054 return self._custom_split_func(text, self._chunk_size, self._chunk_overlap) 1055 else: 1056 raise ValueError(f"Unsupported chunk strategy: {self._chunk_strategy}") 1057 1058 def _split_by_words(self, text: str) -> List[str]: 1059 """ 1060 Split text by word boundaries while respecting word count. 1061 1062 This strategy splits text into chunks based on the number of words, 1063 trying to break at word boundaries when possible. 1064 """ 1065 # Split text into words 1066 words = text.split() 1067 1068 if len(words) <= self._chunk_size: 1069 return [text] 1070 1071 chunks = [] 1072 start_word = 0 1073 1074 while start_word < len(words): 1075 # Calculate end word position for current chunk 1076 end_word = start_word + self._chunk_size 1077 1078 # Extract words for this chunk 1079 chunk_words = words[start_word:end_word] 1080 chunk = ' '.join(chunk_words) 1081 1082 if chunk.strip(): # Only add non-empty chunks 1083 chunks.append(chunk) 1084 1085 # Calculate next start position with overlap 1086 new_start_word = end_word - self._chunk_overlap 1087 1088 # Ensure we always advance to prevent infinite loops 1089 if new_start_word <= start_word: 1090 new_start_word = start_word + 1 1091 1092 start_word = new_start_word 1093 1094 # Safety check to prevent infinite loops 1095 if start_word >= len(words): 1096 break 1097 1098 return chunks 1099 1100 def _split_by_sentences(self, text: str) -> List[str]: 1101 """ 1102 Split text by sentence boundaries while respecting sentence count. 1103 1104 This strategy splits text into chunks based on the number of sentences, 1105 preserving sentence integrity. 1106 """ 1107 # Define sentence endings 1108 sentence_endings = ['.', '!', '?', '\n\n'] 1109 1110 # Split text into sentences 1111 sentences = [] 1112 last_pos = 0 1113 1114 for i, char in enumerate(text): 1115 if char in sentence_endings: 1116 sentence = text[last_pos:i+1].strip() 1117 if sentence: 1118 sentences.append(sentence) 1119 last_pos = i + 1 1120 1121 # Add the last sentence if it doesn't end with punctuation 1122 if last_pos < len(text): 1123 last_sentence = text[last_pos:].strip() 1124 if last_sentence: 1125 sentences.append(last_sentence) 1126 1127 if len(sentences) <= self._chunk_size: 1128 return [text] 1129 1130 chunks = [] 1131 start_sentence = 0 1132 1133 while start_sentence < len(sentences): 1134 # Calculate end sentence position for current chunk 1135 end_sentence = start_sentence + self._chunk_size 1136 1137 # Extract sentences for this chunk 1138 chunk_sentences = sentences[start_sentence:end_sentence] 1139 chunk = ' '.join(chunk_sentences) 1140 1141 if chunk.strip(): # Only add non-empty chunks 1142 chunks.append(chunk) 1143 1144 # Calculate next start position with overlap 1145 new_start_sentence = end_sentence - self._chunk_overlap 1146 1147 # Ensure we always advance to prevent infinite loops 1148 if new_start_sentence <= start_sentence: 1149 new_start_sentence = start_sentence + 1 1150 1151 start_sentence = new_start_sentence 1152 1153 # Safety check to prevent infinite loops 1154 if start_sentence >= len(sentences): 1155 break 1156 1157 return chunks 1158 1159 def _split_by_paragraphs(self, text: str) -> List[str]: 1160 """ 1161 Split text by paragraph boundaries while respecting paragraph count. 1162 1163 This strategy splits text into chunks based on the number of paragraphs, 1164 preserving paragraph integrity. 1165 """ 1166 # Split by paragraph boundaries (double newlines) 1167 paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()] 1168 1169 if len(paragraphs) <= self._chunk_size: 1170 return [text] 1171 1172 chunks = [] 1173 start_paragraph = 0 1174 1175 while start_paragraph < len(paragraphs): 1176 # Calculate end paragraph position for current chunk 1177 end_paragraph = start_paragraph + self._chunk_size 1178 1179 # Extract paragraphs for this chunk 1180 chunk_paragraphs = paragraphs[start_paragraph:end_paragraph] 1181 chunk = '\n\n'.join(chunk_paragraphs) 1182 1183 if chunk.strip(): # Only add non-empty chunks 1184 chunks.append(chunk) 1185 1186 # Calculate next start position with overlap 1187 new_start_paragraph = end_paragraph - self._chunk_overlap 1188 1189 # Ensure we always advance to prevent infinite loops 1190 if new_start_paragraph <= start_paragraph: 1191 new_start_paragraph = start_paragraph + 1 1192 1193 start_paragraph = new_start_paragraph 1194 1195 # Safety check to prevent infinite loops 1196 if start_paragraph >= len(paragraphs): 1197 break 1198 1199 return chunks 1200 1201 def _split_fixed(self, text: str) -> List[str]: 1202 """ 1203 Split text into fixed-size chunks without considering boundaries. 1204 1205 This strategy creates chunks of exactly chunk_size characters 1206 (except possibly the last chunk) without trying to preserve 1207 word or sentence boundaries. 1208 """ 1209 chunks = [] 1210 start = 0 1211 1212 while start < len(text): 1213 end = start + self._chunk_size 1214 chunk = text[start:end].strip() 1215 1216 if chunk: # Only add non-empty chunks 1217 chunks.append(chunk) 1218 1219 # Calculate next start position with overlap 1220 new_start = end - self._chunk_overlap 1221 1222 # Ensure we always advance to prevent infinite loops 1223 if new_start <= start: 1224 new_start = start + 1 1225 1226 start = new_start 1227 1228 # Safety check to prevent infinite loops 1229 if start >= len(text): 1230 break 1231 1232 return chunks 1233 1234 def _split_semantic(self, text: str) -> List[str]: 1235 """ 1236 Split text by semantic boundaries. 1237 1238 This strategy attempts to break text at semantic boundaries like 1239 headers, section breaks, and other structural elements while 1240 respecting the chunk size. 1241 """ 1242 # Define semantic break patterns 1243 semantic_patterns = [ 1244 '\n# ', '\n## ', '\n### ', '\n#### ', # Markdown headers 1245 '\n1. ', '\n2. ', '\n3. ', '\n4. ', '\n5. ', # Numbered lists 1246 '\n• ', '\n- ', '\n* ', # Bullet points 1247 '\n\n', # Paragraph breaks 1248 '\n---\n', '\n___\n', # Horizontal rules 1249 '\n\nChapter ', '\n\nSection ', '\n\nPart ', # Document sections 1250 ] 1251 1252 chunks = [] 1253 current_chunk = "" 1254 1255 # Split text by semantic patterns 1256 parts = [text] 1257 for pattern in semantic_patterns: 1258 new_parts = [] 1259 for part in parts: 1260 if pattern in part: 1261 split_parts = part.split(pattern) 1262 for i, split_part in enumerate(split_parts): 1263 if i > 0: # Add the pattern back to all parts except the first 1264 split_part = pattern + split_part 1265 if split_part.strip(): 1266 new_parts.append(split_part) 1267 else: 1268 new_parts.append(part) 1269 parts = new_parts 1270 1271 # Group parts into chunks 1272 for part in parts: 1273 # If adding this part would exceed chunk size, start a new chunk 1274 if len(current_chunk) + len(part) > self._chunk_size and current_chunk: 1275 chunks.append(current_chunk.strip()) 1276 # Start new chunk with overlap 1277 overlap_start = max(0, len(current_chunk) - self._chunk_overlap) 1278 current_chunk = current_chunk[overlap_start:] + part 1279 else: 1280 current_chunk += part 1281 1282 # Add the last chunk 1283 if current_chunk.strip(): 1284 chunks.append(current_chunk.strip()) 1285 1286 return chunks
A utility class for building document collections from various sources.
This class provides methods to extract text content from files and web pages, split the content into manageable chunks with configurable size and overlap, and prepare the data for storage in vector databases.
The DocumentsBuilder is designed to work seamlessly with the RAG system, producing output that can be directly used with vector database operations.
Features:
- File-based document extraction with UTF-8 encoding support
- Text string processing for in-memory content
- Web scraping with multiple engine options (requests, tavily, selenium)
- Word document extraction (.doc and .docx formats)
- PDF document extraction with metadata
- Multiple chunking strategies (word, sentence, paragraph, fixed, semantic)
- Configurable chunk size and overlap parameters
- Rich metadata generation for each document chunk
- Unique ID generation for database storage
Attributes: _chunk_strategy (str): The chunking strategy to use _chunk_size (int): Maximum size of each text chunk in characters _chunk_overlap (int): Number of characters to overlap between chunks
Examples:
Basic usage with file processing:
# Initialize with default chunk settings (word-based)
builder = DocumentsBuilder()
# Process a text file
documents, metadatas, ids = builder.from_file("document.txt")
# Add to vector database
vector_db.add(documents, metadatas, ids)
Text string processing:
# Process text strings directly
text_content = "This is a long text that needs to be processed..."
documents, metadatas, ids = builder.from_str(text_content)
# Process with custom source name
documents, metadatas, ids = builder.from_str(
text_content,
source_name="user_input"
)
Different chunking strategies:
# Default settings (word-based chunking)
builder = DocumentsBuilder()
# Sentence-based chunking (5 sentences per chunk)
builder = DocumentsBuilder(chunk_strategy="sentence", chunk_size=5)
# Paragraph-based chunking (3 paragraphs per chunk)
builder = DocumentsBuilder(chunk_strategy="paragraph", chunk_size=3)
# Fixed-size chunks (800 characters per chunk)
builder = DocumentsBuilder(chunk_strategy="fixed", chunk_size=800)
# Word-based chunks (50 words per chunk)
builder = DocumentsBuilder(chunk_strategy="word", chunk_size=50)
Web scraping with different engines:
# Basic web scraping
documents, metadatas, ids = builder.from_url("https://example.com")
# Advanced scraping with Tavily
documents, metadatas, ids = builder.from_url(
"https://example.com",
engine="tavily",
deep=True
)
# JavaScript-heavy sites with Selenium
documents, metadatas, ids = builder.from_url(
"https://spa-example.com",
engine="selenium"
)
Word document processing:
# Process Word documents
documents, metadatas, ids = builder.from_doc("document.docx")
documents, metadatas, ids = builder.from_doc("document.doc")
# Process with custom extraction method
documents, metadatas, ids = builder.from_doc(
"document.docx",
extraction_method="docx2txt"
)
PDF document processing:
# Process PDF documents
documents, metadatas, ids = builder.from_pdf("document.pdf")
# Process with page range
documents, metadatas, ids = builder.from_pdf(
"document.pdf",
page_range=(1, 10) # Extract pages 1-10
)
Notes:
- chunk_overlap should typically be 10-20% of chunk_size
- chunk_overlap must be less than chunk_size to prevent infinite loops
- Different strategies interpret chunk_size differently:
- word: chunk_size = number of words per chunk
- sentence: chunk_size = number of sentences per chunk
- paragraph: chunk_size = number of paragraphs per chunk
- fixed: chunk_size = number of characters per chunk
- semantic: chunk_size = number of characters per chunk
- Very small chunks may lose context
- Very large chunks may be less focused for retrieval
- Fixed and semantic strategies always produce chunks of exactly chunk_size (except the last one)
- Word document processing requires python-docx and python-docx2txt packages
- PDF processing requires PyPDF2 package
164 def __init__(self, chunk_strategy: str = "word", chunk_size: int = 1000, chunk_overlap: int = 0, custom_split_func: Optional[callable] = None): 165 """ 166 Initialize the DocumentsBuilder with chunking parameters. 167 168 Parameters: 169 ----------- 170 chunk_strategy : str, default="word" 171 The strategy to use for text chunking: 172 - "word": Break at word boundaries (spaces and newlines) when possible 173 - "sentence": Break at sentence boundaries (periods, exclamation marks, question marks) 174 - "paragraph": Break at paragraph boundaries (double newlines) 175 - "fixed": Break at exact character count without considering boundaries 176 - "semantic": Break at semantic boundaries (headers, sections, etc.) 177 - "custom": Use the provided custom_split_func for chunking 178 179 chunk_size : int, default=1000 180 The size limit for each chunk, interpreted differently based on strategy: 181 - "word": Maximum number of words per chunk 182 - "sentence": Maximum number of sentences per chunk 183 - "paragraph": Maximum number of paragraphs per chunk 184 - "fixed": Maximum number of characters per chunk 185 - "semantic": Maximum number of characters per chunk 186 - "custom": Passed to custom_split_func as a parameter 187 188 chunk_overlap : int, default=0 189 The overlap between consecutive chunks, interpreted based on strategy: 190 - "word": Number of words to overlap 191 - "sentence": Number of sentences to overlap 192 - "paragraph": Number of paragraphs to overlap 193 - "fixed": Number of characters to overlap 194 - "semantic": Number of characters to overlap 195 - "custom": Passed to custom_split_func as a parameter 196 197 custom_split_func : callable, optional 198 Custom function to use for text splitting. If provided, automatically sets chunk_strategy to "custom" 199 regardless of the chunk_strategy parameter value. 200 The function should have the signature: func(text: str, chunk_size: int, chunk_overlap: int) -> List[str] 201 and return a list of text chunks. 202 203 Raises: 204 ------- 205 ValueError 206 If chunk_overlap >= chunk_size (would cause infinite loops) 207 If chunk_size <= 0 208 If chunk_overlap < 0 209 If chunk_strategy="custom" but no custom_split_func is provided 210 211 Examples: 212 --------- 213 ```python 214 # Default settings (word-based chunking) 215 builder = DocumentsBuilder() 216 217 # Sentence-based chunking (5 sentences per chunk) 218 builder = DocumentsBuilder(chunk_strategy="sentence", chunk_size=5) 219 220 # Paragraph-based chunking (3 paragraphs per chunk) 221 builder = DocumentsBuilder(chunk_strategy="paragraph", chunk_size=3) 222 223 # Fixed-size chunks (800 characters per chunk) 224 builder = DocumentsBuilder(chunk_strategy="fixed", chunk_size=800) 225 226 # Word-based chunks (50 words per chunk) 227 builder = DocumentsBuilder(chunk_strategy="word", chunk_size=50) 228 229 # Custom chunking function 230 def my_custom_split(text, chunk_size, chunk_overlap): 231 # Split by lines and then by chunk_size 232 lines = text.split('\n') 233 chunks = [] 234 for i in range(0, len(lines), chunk_size - chunk_overlap): 235 chunk_lines = lines[i:i + chunk_size] 236 chunks.append('\n'.join(chunk_lines)) 237 return chunks 238 239 # Strategy automatically set to "custom" when custom_split_func is provided 240 builder = DocumentsBuilder( 241 chunk_size=100, 242 chunk_overlap=10, 243 custom_split_func=my_custom_split 244 ) 245 246 # Or explicitly set strategy (will be overridden to "custom") 247 builder = DocumentsBuilder( 248 chunk_strategy="word", # This will be ignored 249 chunk_size=100, 250 chunk_overlap=10, 251 custom_split_func=my_custom_split # Strategy becomes "custom" 252 ) 253 ``` 254 255 Notes: 256 ------ 257 - chunk_overlap should typically be 10-20% of chunk_size 258 - chunk_overlap must be less than chunk_size to prevent infinite loops 259 - Different strategies interpret chunk_size differently: 260 * word: chunk_size = number of words per chunk 261 * sentence: chunk_size = number of sentences per chunk 262 * paragraph: chunk_size = number of paragraphs per chunk 263 * fixed: chunk_size = number of characters per chunk 264 * semantic: chunk_size = number of characters per chunk 265 * custom: chunk_size is passed to custom_split_func 266 - Very small chunks may lose context 267 - Very large chunks may be less focused for retrieval 268 - Fixed and semantic strategies always produce chunks of exactly chunk_size (except the last one) 269 - Custom functions should handle their own overlap logic 270 - Custom functions can implement any splitting logic: 271 * Split by specific delimiters (e.g., "---", "###") 272 * Split by regex patterns 273 * Split by semantic boundaries using NLP libraries 274 * Split by document structure (headers, sections, etc.) 275 * Combine multiple strategies 276 - When custom_split_func is provided, chunk_strategy is automatically set to "custom" 277 regardless of the chunk_strategy parameter value 278 """ 279 # If custom_split_func is provided, automatically set strategy to "custom" 280 if custom_split_func is not None: 281 chunk_strategy = "custom" 282 283 self._chunk_strategy = chunk_strategy 284 self._chunk_size = chunk_size 285 self._chunk_overlap = chunk_overlap 286 self._custom_split_func = custom_split_func 287 288 # Validate parameters to prevent infinite loops 289 if chunk_overlap >= chunk_size: 290 raise ValueError( 291 f"chunk_overlap ({chunk_overlap}) must be less than chunk_size ({chunk_size}) " 292 "to prevent infinite loops. Recommended: chunk_overlap should be 10-20% of chunk_size." 293 ) 294 295 if chunk_size <= 0: 296 raise ValueError(f"chunk_size must be positive, got {chunk_size}") 297 298 if chunk_overlap < 0: 299 raise ValueError(f"chunk_overlap must be non-negative, got {chunk_overlap}") 300 301 # Validate custom split function 302 if chunk_strategy == "custom" and custom_split_func is None: 303 raise ValueError("custom_split_func must be provided when chunk_strategy='custom'") 304 305 if custom_split_func is not None and not callable(custom_split_func): 306 raise ValueError("custom_split_func must be callable")
Initialize the DocumentsBuilder with chunking parameters.
Parameters:
-----------
chunk_strategy : str, default="word"
The strategy to use for text chunking:
- "word": Break at word boundaries (spaces and newlines) when possible
- "sentence": Break at sentence boundaries (periods, exclamation marks, question marks)
- "paragraph": Break at paragraph boundaries (double newlines)
- "fixed": Break at exact character count without considering boundaries
- "semantic": Break at semantic boundaries (headers, sections, etc.)
- "custom": Use the provided custom_split_func for chunking
chunk_size : int, default=1000
The size limit for each chunk, interpreted differently based on strategy:
- "word": Maximum number of words per chunk
- "sentence": Maximum number of sentences per chunk
- "paragraph": Maximum number of paragraphs per chunk
- "fixed": Maximum number of characters per chunk
- "semantic": Maximum number of characters per chunk
- "custom": Passed to custom_split_func as a parameter
chunk_overlap : int, default=0
The overlap between consecutive chunks, interpreted based on strategy:
- "word": Number of words to overlap
- "sentence": Number of sentences to overlap
- "paragraph": Number of paragraphs to overlap
- "fixed": Number of characters to overlap
- "semantic": Number of characters to overlap
- "custom": Passed to custom_split_func as a parameter
custom_split_func : callable, optional
Custom function to use for text splitting. If provided, automatically sets chunk_strategy to "custom"
regardless of the chunk_strategy parameter value.
The function should have the signature: func(text: str, chunk_size: int, chunk_overlap: int) -> List[str]
and return a list of text chunks.
Raises:
-------
ValueError
If chunk_overlap >= chunk_size (would cause infinite loops)
If chunk_size <= 0
If chunk_overlap < 0
If chunk_strategy="custom" but no custom_split_func is provided
Examples:
---------
# Default settings (word-based chunking)
builder = DocumentsBuilder()
# Sentence-based chunking (5 sentences per chunk)
builder = DocumentsBuilder(chunk_strategy="sentence", chunk_size=5)
# Paragraph-based chunking (3 paragraphs per chunk)
builder = DocumentsBuilder(chunk_strategy="paragraph", chunk_size=3)
# Fixed-size chunks (800 characters per chunk)
builder = DocumentsBuilder(chunk_strategy="fixed", chunk_size=800)
# Word-based chunks (50 words per chunk)
builder = DocumentsBuilder(chunk_strategy="word", chunk_size=50)
# Custom chunking function
def my_custom_split(text, chunk_size, chunk_overlap):
# Split by lines and then by chunk_size
lines = text.split('
')
chunks = []
for i in range(0, len(lines), chunk_size - chunk_overlap):
chunk_lines = lines[i:i + chunk_size]
chunks.append('
'.join(chunk_lines))
return chunks
# Strategy automatically set to "custom" when custom_split_func is provided
builder = DocumentsBuilder(
chunk_size=100,
chunk_overlap=10,
custom_split_func=my_custom_split
)
# Or explicitly set strategy (will be overridden to "custom")
builder = DocumentsBuilder(
chunk_strategy="word", # This will be ignored
chunk_size=100,
chunk_overlap=10,
custom_split_func=my_custom_split # Strategy becomes "custom"
)
Notes:
------
- chunk_overlap should typically be 10-20% of chunk_size
- chunk_overlap must be less than chunk_size to prevent infinite loops
- Different strategies interpret chunk_size differently:
* word: chunk_size = number of words per chunk
* sentence: chunk_size = number of sentences per chunk
* paragraph: chunk_size = number of paragraphs per chunk
* fixed: chunk_size = number of characters per chunk
* semantic: chunk_size = number of characters per chunk
* custom: chunk_size is passed to custom_split_func
- Very small chunks may lose context
- Very large chunks may be less focused for retrieval
- Fixed and semantic strategies always produce chunks of exactly chunk_size (except the last one)
- Custom functions should handle their own overlap logic
- Custom functions can implement any splitting logic:
* Split by specific delimiters (e.g., "---", "###")
* Split by regex patterns
* Split by semantic boundaries using NLP libraries
* Split by document structure (headers, sections, etc.)
* Combine multiple strategies
- When custom_split_func is provided, chunk_strategy is automatically set to "custom"
regardless of the chunk_strategy parameter value
308 def from_file(self, file_path: str) -> Tuple[List[str], List[Dict], List[str]]: 309 """ 310 Read a file and split it into chunks with specified size and overlap. 311 312 This method reads a text file from the filesystem, splits its content 313 into chunks according to the configured parameters, and generates 314 metadata and unique IDs for each chunk. 315 316 Parameters: 317 ----------- 318 file_path : str 319 Path to the text file to read. The file must exist and be 320 readable. UTF-8 encoding is assumed. 321 322 Returns: 323 -------- 324 Tuple[List[str], List[Dict], List[str]] 325 A tuple containing: 326 - List of document chunks (strings): The text content split into chunks 327 - List of metadata dictionaries: Metadata for each chunk including 328 file information and chunk details 329 - List of unique IDs: UUID strings for each chunk 330 331 Raises: 332 ------- 333 FileNotFoundError 334 If the specified file does not exist or is not accessible. 335 336 UnicodeDecodeError 337 If the file cannot be decoded as UTF-8. 338 339 Examples: 340 --------- 341 ```python 342 # Process a single file 343 documents, metadatas, ids = builder.from_file("article.txt") 344 345 # Access metadata information 346 for i, metadata in enumerate(metadatas): 347 print(f"Chunk {i+1}:") 348 print(f" File: {metadata['file_name']}") 349 print(f" Size: {metadata['chunk_size']} characters") 350 print(f" Position: {metadata['chunk_index'] + 1}/{metadata['total_chunks']}") 351 ``` 352 353 Notes: 354 ------ 355 - File is read entirely into memory before processing 356 - Empty files will return empty lists 357 - File path is stored in metadata for traceability 358 - Chunk indexing starts at 0 359 """ 360 if not os.path.exists(file_path): 361 raise FileNotFoundError(f"File not found: {file_path}") 362 363 # Read the file content 364 with open(file_path, 'r', encoding='utf-8') as file: 365 text = file.read() 366 367 # Split text into chunks 368 chunks = self._split_text(text) 369 370 # Generate metadata and IDs for each chunk 371 documents = [] 372 metadatas = [] 373 ids = [] 374 375 for i, chunk in enumerate(chunks): 376 # Generate unique ID 377 chunk_id = str(uuid.uuid4()) 378 379 # Create metadata 380 metadata = { 381 'file_path': file_path, 382 'file_name': os.path.basename(file_path), 383 'chunk_index': i, 384 'total_chunks': len(chunks), 385 'chunk_size': len(chunk) 386 } 387 388 documents.append(chunk) 389 metadatas.append(metadata) 390 ids.append(chunk_id) 391 392 return documents, metadatas, ids
Read a file and split it into chunks with specified size and overlap.
This method reads a text file from the filesystem, splits its content into chunks according to the configured parameters, and generates metadata and unique IDs for each chunk.
Parameters:
file_path : str Path to the text file to read. The file must exist and be readable. UTF-8 encoding is assumed.
Returns:
Tuple[List[str], List[Dict], List[str]] A tuple containing: - List of document chunks (strings): The text content split into chunks - List of metadata dictionaries: Metadata for each chunk including file information and chunk details - List of unique IDs: UUID strings for each chunk
Raises:
FileNotFoundError If the specified file does not exist or is not accessible.
UnicodeDecodeError If the file cannot be decoded as UTF-8.
Examples:
# Process a single file
documents, metadatas, ids = builder.from_file("article.txt")
# Access metadata information
for i, metadata in enumerate(metadatas):
print(f"Chunk {i+1}:")
print(f" File: {metadata['file_name']}")
print(f" Size: {metadata['chunk_size']} characters")
print(f" Position: {metadata['chunk_index'] + 1}/{metadata['total_chunks']}")
Notes:
- File is read entirely into memory before processing
- Empty files will return empty lists
- File path is stored in metadata for traceability
- Chunk indexing starts at 0
394 def from_str(self, text: str, source_name: str = "text_string") -> Tuple[List[str], List[Dict], List[str]]: 395 """ 396 Process a text string and split it into chunks with specified size and overlap. 397 398 This method takes a text string directly and processes it using the same 399 chunking logic as file processing. It's useful when you already have 400 text content in memory and want to prepare it for vector database storage. 401 402 Parameters: 403 ----------- 404 text : str 405 The text content to process and split into chunks. 406 407 source_name : str, default="text_string" 408 A descriptive name for the text source. This will be included 409 in the metadata for traceability and identification. 410 411 Returns: 412 -------- 413 Tuple[List[str], List[Dict], List[str]] 414 A tuple containing: 415 - List of document chunks (strings): The text content split into chunks 416 - List of metadata dictionaries: Metadata for each chunk including 417 source information and chunk details 418 - List of unique IDs: UUID strings for each chunk 419 420 Examples: 421 --------- 422 ```python 423 # Process a simple text string 424 text_content = "This is a long text that needs to be processed..." 425 documents, metadatas, ids = builder.from_str(text_content) 426 427 # Process with custom source name 428 documents, metadatas, ids = builder.from_str( 429 text_content, 430 source_name="user_input" 431 ) 432 433 # Process multiple text strings 434 text_parts = [ 435 "First part of the document...", 436 "Second part of the document...", 437 "Third part of the document..." 438 ] 439 440 all_documents = [] 441 all_metadatas = [] 442 all_ids = [] 443 444 for i, text_part in enumerate(text_parts): 445 documents, metadatas, ids = builder.from_str( 446 text_part, 447 source_name=f"document_part_{i+1}" 448 ) 449 all_documents.extend(documents) 450 all_metadatas.extend(metadatas) 451 all_ids.extend(ids) 452 ``` 453 454 Notes: 455 ------ 456 - Uses the same chunking strategy and parameters as other methods 457 - Empty strings will return empty lists 458 - Source name is stored in metadata for identification 459 - Useful for processing text from APIs, user input, or generated content 460 """ 461 if not text or not text.strip(): 462 return [], [], [] 463 464 # Split text into chunks 465 chunks = self._split_text(text) 466 467 # Generate metadata and IDs for each chunk 468 documents = [] 469 metadatas = [] 470 ids = [] 471 472 for i, chunk in enumerate(chunks): 473 # Generate unique ID 474 chunk_id = str(uuid.uuid4()) 475 476 # Create metadata 477 metadata = { 478 'source_type': 'text_string', 479 'source_name': source_name, 480 'chunk_index': i, 481 'total_chunks': len(chunks), 482 'chunk_size': len(chunk), 483 'chunk_strategy': self._chunk_strategy 484 } 485 486 documents.append(chunk) 487 metadatas.append(metadata) 488 ids.append(chunk_id) 489 490 return documents, metadatas, ids
Process a text string and split it into chunks with specified size and overlap.
This method takes a text string directly and processes it using the same chunking logic as file processing. It's useful when you already have text content in memory and want to prepare it for vector database storage.
Parameters:
text : str The text content to process and split into chunks.
source_name : str, default="text_string" A descriptive name for the text source. This will be included in the metadata for traceability and identification.
Returns:
Tuple[List[str], List[Dict], List[str]] A tuple containing: - List of document chunks (strings): The text content split into chunks - List of metadata dictionaries: Metadata for each chunk including source information and chunk details - List of unique IDs: UUID strings for each chunk
Examples:
# Process a simple text string
text_content = "This is a long text that needs to be processed..."
documents, metadatas, ids = builder.from_str(text_content)
# Process with custom source name
documents, metadatas, ids = builder.from_str(
text_content,
source_name="user_input"
)
# Process multiple text strings
text_parts = [
"First part of the document...",
"Second part of the document...",
"Third part of the document..."
]
all_documents = []
all_metadatas = []
all_ids = []
for i, text_part in enumerate(text_parts):
documents, metadatas, ids = builder.from_str(
text_part,
source_name=f"document_part_{i+1}"
)
all_documents.extend(documents)
all_metadatas.extend(metadatas)
all_ids.extend(ids)
Notes:
- Uses the same chunking strategy and parameters as other methods
- Empty strings will return empty lists
- Source name is stored in metadata for identification
- Useful for processing text from APIs, user input, or generated content
492 def from_doc(self, file_path: str, extraction_method: str = "auto") -> Tuple[List[str], List[Dict], List[str]]: 493 """ 494 Extract text from Word documents (.doc and .docx files) and split into chunks. 495 496 This method supports both .doc and .docx formats using different extraction 497 methods. For .docx files, it can use either python-docx or docx2txt libraries. 498 For .doc files, it uses docx2txt which can handle the older format. 499 500 Parameters: 501 ----------- 502 file_path : str 503 Path to the Word document (.doc or .docx file). The file must exist 504 and be readable. 505 506 extraction_method : str, default="auto" 507 The method to use for text extraction: 508 - "auto": Automatically choose the best method based on file extension 509 - "docx": Use python-docx library (only for .docx files) 510 - "docx2txt": Use docx2txt library (works for both .doc and .docx) 511 512 Returns: 513 -------- 514 Tuple[List[str], List[Dict], List[str]] 515 A tuple containing: 516 - List of document chunks (strings): The extracted text split into chunks 517 - List of metadata dictionaries: Metadata for each chunk including 518 file information, document properties, and chunk details 519 - List of unique IDs: UUID strings for each chunk 520 521 Raises: 522 ------- 523 FileNotFoundError 524 If the specified file does not exist or is not accessible. 525 526 ValueError 527 If the file is not a supported Word document format or if the 528 required extraction method is not available. 529 530 ImportError 531 If the required libraries for the chosen extraction method are not installed. 532 533 Examples: 534 --------- 535 ```python 536 # Process a .docx file with automatic method selection 537 documents, metadatas, ids = builder.from_doc("document.docx") 538 539 # Process a .doc file 540 documents, metadatas, ids = builder.from_doc("document.doc") 541 542 # Force specific extraction method 543 documents, metadatas, ids = builder.from_doc( 544 "document.docx", 545 extraction_method="docx2txt" 546 ) 547 548 # Access document metadata 549 for metadata in metadatas: 550 print(f"File: {metadata['file_name']}") 551 print(f"Format: {metadata['document_format']}") 552 print(f"Extraction method: {metadata['extraction_method']}") 553 ``` 554 555 Notes: 556 ------ 557 - .docx files are the modern Word format (Office 2007+) 558 - .doc files are the legacy Word format (Office 97-2003) 559 - python-docx provides better structure preservation for .docx files 560 - docx2txt works with both formats but may lose some formatting 561 - Document properties (title, author, etc.) are extracted when available 562 - Images and complex formatting are not preserved in the extracted text 563 """ 564 if not os.path.exists(file_path): 565 raise FileNotFoundError(f"File not found: {file_path}") 566 567 # Determine file extension and validate 568 file_extension = os.path.splitext(file_path)[1].lower() 569 if file_extension not in ['.doc', '.docx']: 570 raise ValueError(f"Unsupported file format: {file_extension}. Only .doc and .docx files are supported.") 571 572 # Determine extraction method 573 if extraction_method == "auto": 574 if file_extension == '.docx' and DOCX_AVAILABLE: 575 extraction_method = "docx" 576 else: 577 extraction_method = "docx2txt" 578 579 # Extract text based on method 580 if extraction_method == "docx": 581 if not DOCX_AVAILABLE: 582 raise ImportError("python-docx library is required for 'docx' extraction method. Install with: pip install python-docx") 583 if file_extension != '.docx': 584 raise ValueError("'docx' extraction method only supports .docx files") 585 text, doc_properties = self._extract_with_docx(file_path) 586 elif extraction_method == "docx2txt": 587 if not DOCX2TXT_AVAILABLE: 588 raise ImportError("docx2txt library is required for 'docx2txt' extraction method. Install with: pip install python-docx2txt") 589 text, doc_properties = self._extract_with_docx2txt(file_path) 590 else: 591 raise ValueError(f"Unsupported extraction method: {extraction_method}") 592 593 # Split text into chunks 594 chunks = self._split_text(text) 595 596 # Generate metadata and IDs for each chunk 597 documents = [] 598 metadatas = [] 599 ids = [] 600 601 for i, chunk in enumerate(chunks): 602 # Generate unique ID 603 chunk_id = str(uuid.uuid4()) 604 605 # Create metadata 606 metadata = { 607 'file_path': file_path, 608 'file_name': os.path.basename(file_path), 609 'document_format': file_extension[1:], # Remove the dot 610 'extraction_method': extraction_method, 611 'chunk_index': i, 612 'total_chunks': len(chunks), 613 'chunk_size': len(chunk) 614 } 615 616 # Add document properties if available 617 if doc_properties: 618 metadata.update(doc_properties) 619 620 documents.append(chunk) 621 metadatas.append(metadata) 622 ids.append(chunk_id) 623 624 return documents, metadatas, ids
Extract text from Word documents (.doc and .docx files) and split into chunks.
This method supports both .doc and .docx formats using different extraction methods. For .docx files, it can use either python-docx or docx2txt libraries. For .doc files, it uses docx2txt which can handle the older format.
Parameters:
file_path : str Path to the Word document (.doc or .docx file). The file must exist and be readable.
extraction_method : str, default="auto" The method to use for text extraction: - "auto": Automatically choose the best method based on file extension - "docx": Use python-docx library (only for .docx files) - "docx2txt": Use docx2txt library (works for both .doc and .docx)
Returns:
Tuple[List[str], List[Dict], List[str]] A tuple containing: - List of document chunks (strings): The extracted text split into chunks - List of metadata dictionaries: Metadata for each chunk including file information, document properties, and chunk details - List of unique IDs: UUID strings for each chunk
Raises:
FileNotFoundError If the specified file does not exist or is not accessible.
ValueError If the file is not a supported Word document format or if the required extraction method is not available.
ImportError If the required libraries for the chosen extraction method are not installed.
Examples:
# Process a .docx file with automatic method selection
documents, metadatas, ids = builder.from_doc("document.docx")
# Process a .doc file
documents, metadatas, ids = builder.from_doc("document.doc")
# Force specific extraction method
documents, metadatas, ids = builder.from_doc(
"document.docx",
extraction_method="docx2txt"
)
# Access document metadata
for metadata in metadatas:
print(f"File: {metadata['file_name']}")
print(f"Format: {metadata['document_format']}")
print(f"Extraction method: {metadata['extraction_method']}")
Notes:
- .docx files are the modern Word format (Office 2007+)
- .doc files are the legacy Word format (Office 97-2003)
- python-docx provides better structure preservation for .docx files
- docx2txt works with both formats but may lose some formatting
- Document properties (title, author, etc.) are extracted when available
- Images and complex formatting are not preserved in the extracted text
626 def from_pdf(self, file_path: str, page_range: Optional[Tuple[int, int]] = None) -> Tuple[List[str], List[Dict], List[str]]: 627 """ 628 Extract text from PDF documents and split into chunks. 629 630 This method extracts text content from PDF files using PyPDF2 library. 631 It supports extracting all pages or a specific range of pages, and 632 preserves page information in the metadata. 633 634 Parameters: 635 ----------- 636 file_path : str 637 Path to the PDF file. The file must exist and be readable. 638 639 page_range : Tuple[int, int], optional 640 Range of pages to extract (start_page, end_page), where pages are 641 1-indexed. If None, all pages are extracted. 642 Example: (1, 5) extracts pages 1 through 5. 643 644 Returns: 645 -------- 646 Tuple[List[str], List[Dict], List[str]] 647 A tuple containing: 648 - List of document chunks (strings): The extracted text split into chunks 649 - List of metadata dictionaries: Metadata for each chunk including 650 file information, PDF properties, page information, and chunk details 651 - List of unique IDs: UUID strings for each chunk 652 653 Raises: 654 ------- 655 FileNotFoundError 656 If the specified file does not exist or is not accessible. 657 658 ValueError 659 If the file is not a valid PDF or if the page range is invalid. 660 661 ImportError 662 If PyPDF2 library is not installed. 663 664 Examples: 665 --------- 666 ```python 667 # Process entire PDF 668 documents, metadatas, ids = builder.from_pdf("document.pdf") 669 670 # Process specific page range 671 documents, metadatas, ids = builder.from_pdf( 672 "document.pdf", 673 page_range=(1, 10) # Pages 1-10 674 ) 675 676 # Process single page 677 documents, metadatas, ids = builder.from_pdf( 678 "document.pdf", 679 page_range=(5, 5) # Only page 5 680 ) 681 682 # Access PDF metadata 683 for metadata in metadatas: 684 print(f"File: {metadata['file_name']}") 685 print(f"Page: {metadata.get('page_number', 'N/A')}") 686 print(f"Total pages: {metadata.get('total_pages', 'N/A')}") 687 print(f"PDF title: {metadata.get('pdf_title', 'N/A')}") 688 ``` 689 690 Notes: 691 ------ 692 - Page numbers are 1-indexed (first page is page 1) 693 - Text extraction quality depends on the PDF structure 694 - Scanned PDFs may not extract text properly 695 - PDF metadata (title, author, etc.) is extracted when available 696 - Page information is preserved in chunk metadata 697 - Images and complex formatting are not preserved 698 """ 699 if not PDF_AVAILABLE: 700 raise ImportError("PyPDF2 library is required for PDF processing. Install with: pip install PyPDF2") 701 702 if not os.path.exists(file_path): 703 raise FileNotFoundError(f"File not found: {file_path}") 704 705 # Validate file extension 706 file_extension = os.path.splitext(file_path)[1].lower() 707 if file_extension != '.pdf': 708 raise ValueError(f"Unsupported file format: {file_extension}. Only .pdf files are supported.") 709 710 # Extract text and metadata from PDF 711 text, pdf_properties, page_info = self._extract_from_pdf(file_path, page_range) 712 713 # Split text into chunks 714 chunks = self._split_text(text) 715 716 # Generate metadata and IDs for each chunk 717 documents = [] 718 metadatas = [] 719 ids = [] 720 721 for i, chunk in enumerate(chunks): 722 # Generate unique ID 723 chunk_id = str(uuid.uuid4()) 724 725 # Create metadata 726 metadata = { 727 'file_path': file_path, 728 'file_name': os.path.basename(file_path), 729 'document_format': 'pdf', 730 'chunk_index': i, 731 'total_chunks': len(chunks), 732 'chunk_size': len(chunk) 733 } 734 735 # Add PDF properties if available 736 if pdf_properties: 737 metadata.update(pdf_properties) 738 739 # Add page information if available 740 if page_info: 741 metadata.update(page_info) 742 743 documents.append(chunk) 744 metadatas.append(metadata) 745 ids.append(chunk_id) 746 747 return documents, metadatas, ids
Extract text from PDF documents and split into chunks.
This method extracts text content from PDF files using PyPDF2 library. It supports extracting all pages or a specific range of pages, and preserves page information in the metadata.
Parameters:
file_path : str Path to the PDF file. The file must exist and be readable.
page_range : Tuple[int, int], optional Range of pages to extract (start_page, end_page), where pages are 1-indexed. If None, all pages are extracted. Example: (1, 5) extracts pages 1 through 5.
Returns:
Tuple[List[str], List[Dict], List[str]] A tuple containing: - List of document chunks (strings): The extracted text split into chunks - List of metadata dictionaries: Metadata for each chunk including file information, PDF properties, page information, and chunk details - List of unique IDs: UUID strings for each chunk
Raises:
FileNotFoundError If the specified file does not exist or is not accessible.
ValueError If the file is not a valid PDF or if the page range is invalid.
ImportError If PyPDF2 library is not installed.
Examples:
# Process entire PDF
documents, metadatas, ids = builder.from_pdf("document.pdf")
# Process specific page range
documents, metadatas, ids = builder.from_pdf(
"document.pdf",
page_range=(1, 10) # Pages 1-10
)
# Process single page
documents, metadatas, ids = builder.from_pdf(
"document.pdf",
page_range=(5, 5) # Only page 5
)
# Access PDF metadata
for metadata in metadatas:
print(f"File: {metadata['file_name']}")
print(f"Page: {metadata.get('page_number', 'N/A')}")
print(f"Total pages: {metadata.get('total_pages', 'N/A')}")
print(f"PDF title: {metadata.get('pdf_title', 'N/A')}")
Notes:
- Page numbers are 1-indexed (first page is page 1)
- Text extraction quality depends on the PDF structure
- Scanned PDFs may not extract text properly
- PDF metadata (title, author, etc.) is extracted when available
- Page information is preserved in chunk metadata
- Images and complex formatting are not preserved
749 def from_url(self, url: str, engine: str = "requests", deep: bool = False) -> Tuple[List[str], List[Dict], List[str]]: 750 """ 751 Scrape content from a URL and split it into chunks with specified size and overlap. 752 753 This method uses web scraping to extract text content from a webpage, 754 then processes the content using the same chunking logic as file processing. 755 Multiple scraping engines are supported for different types of websites. 756 757 Parameters: 758 ----------- 759 url : str 760 The URL to scrape. Must be a valid HTTP/HTTPS URL. 761 762 engine : str, default="requests" 763 The web scraping engine to use: 764 - "requests": Simple HTTP requests (fast, good for static content) 765 - "tavily": Advanced web scraping with better content extraction 766 - "selenium": Full browser automation (good for JavaScript-heavy sites) 767 768 deep : bool, default=False 769 If using the "tavily" engine, whether to use advanced extraction mode. 770 Deep extraction provides better content quality but is slower. 771 772 Returns: 773 -------- 774 Tuple[List[str], List[Dict], List[str]] 775 A tuple containing: 776 - List of document chunks (strings): The scraped text split into chunks 777 - List of metadata dictionaries: Metadata for each chunk including 778 URL information and scraping details 779 - List of unique IDs: UUID strings for each chunk 780 781 Raises: 782 ------- 783 ValueError 784 If the scraping fails or no text content is extracted. 785 786 Examples: 787 --------- 788 ```python 789 # Basic web scraping 790 documents, metadatas, ids = builder.from_url("https://example.com") 791 792 # Advanced scraping with Tavily 793 documents, metadatas, ids = builder.from_url( 794 "https://blog.example.com", 795 engine="tavily", 796 deep=True 797 ) 798 799 # JavaScript-heavy site with Selenium 800 documents, metadatas, ids = builder.from_url( 801 "https://spa.example.com", 802 engine="selenium" 803 ) 804 805 # Access scraping metadata 806 for metadata in metadatas: 807 print(f"Source: {metadata['url']}") 808 print(f"Engine: {metadata['scraping_engine']}") 809 print(f"Deep extraction: {metadata['deep_extraction']}") 810 ``` 811 812 Notes: 813 ------ 814 - Scraping may take time depending on the engine and website complexity 815 - Some websites may block automated scraping 816 - Selenium requires Chrome/Chromium to be installed 817 - Tavily requires an API key to be configured 818 """ 819 # Initialize WebScraping with specified engine 820 scraper = WebScraping(engine=engine, deep=deep) 821 822 # Scrape the URL 823 result = scraper.scrape(url) 824 825 if not result or not result.get("text"): 826 raise ValueError(f"Failed to extract text content from URL: {url}") 827 828 text = result["text"] 829 830 # Split text into chunks 831 chunks = self._split_text(text) 832 833 # Generate metadata and IDs for each chunk 834 documents = [] 835 metadatas = [] 836 ids = [] 837 838 for i, chunk in enumerate(chunks): 839 # Generate unique ID 840 chunk_id = str(uuid.uuid4()) 841 842 # Create metadata 843 metadata = { 844 'url': url, 845 'source_type': 'web_page', 846 'scraping_engine': engine, 847 'deep_extraction': deep, 848 'chunk_index': i, 849 'total_chunks': len(chunks), 850 'chunk_size': len(chunk) 851 } 852 853 documents.append(chunk) 854 metadatas.append(metadata) 855 ids.append(chunk_id) 856 857 return documents, metadatas, ids
Scrape content from a URL and split it into chunks with specified size and overlap.
This method uses web scraping to extract text content from a webpage, then processes the content using the same chunking logic as file processing. Multiple scraping engines are supported for different types of websites.
Parameters:
url : str The URL to scrape. Must be a valid HTTP/HTTPS URL.
engine : str, default="requests" The web scraping engine to use: - "requests": Simple HTTP requests (fast, good for static content) - "tavily": Advanced web scraping with better content extraction - "selenium": Full browser automation (good for JavaScript-heavy sites)
deep : bool, default=False If using the "tavily" engine, whether to use advanced extraction mode. Deep extraction provides better content quality but is slower.
Returns:
Tuple[List[str], List[Dict], List[str]] A tuple containing: - List of document chunks (strings): The scraped text split into chunks - List of metadata dictionaries: Metadata for each chunk including URL information and scraping details - List of unique IDs: UUID strings for each chunk
Raises:
ValueError If the scraping fails or no text content is extracted.
Examples:
# Basic web scraping
documents, metadatas, ids = builder.from_url("https://example.com")
# Advanced scraping with Tavily
documents, metadatas, ids = builder.from_url(
"https://blog.example.com",
engine="tavily",
deep=True
)
# JavaScript-heavy site with Selenium
documents, metadatas, ids = builder.from_url(
"https://spa.example.com",
engine="selenium"
)
# Access scraping metadata
for metadata in metadatas:
print(f"Source: {metadata['url']}")
print(f"Engine: {metadata['scraping_engine']}")
print(f"Deep extraction: {metadata['deep_extraction']}")
Notes:
- Scraping may take time depending on the engine and website complexity
- Some websites may block automated scraping
- Selenium requires Chrome/Chromium to be installed
- Tavily requires an API key to be configured