monoai.rag

RAG is a module that provides a high-level interface for performing semantic search queries against a vector database. It supports multiple vector database backends and embedding providers for flexible deployment scenarios.

 1"""
 2RAG is a module that provides a high-level interface for performing semantic search queries
 3against a vector database. It supports multiple vector database backends and
 4embedding providers for flexible deployment scenarios.
 5"""
 6
 7from .rag import RAG
 8from .vectordb import ChromaVectorDB
 9from .documents_builder import DocumentsBuilder
10
11__all__ = ['RAG', 'ChromaVectorDB', 'DocumentsBuilder'] 
class RAG:
  7class RAG:
  8    """
  9    Retrieval-Augmented Generation (RAG) system for semantic search and document retrieval.
 10    
 11    This class provides a high-level interface for performing semantic search queries
 12    against a vector database. It supports multiple vector database backends and
 13    embedding providers for flexible deployment scenarios.
 14    
 15    The RAG system works by:
 16    1. Converting text queries into vector embeddings
 17    2. Searching the vector database for similar document embeddings
 18    3. Returning the most relevant documents based on semantic similarity
 19    
 20    Attributes:
 21        _vectorizer (str): The embedding model used for vectorization
 22        _db (str): Name of the vector database
 23        _vector_db (ChromaVectorDB): The vector database backend
 24    
 25    Examples:
 26    --------
 27     Basic usage with default settings:
 28        
 29    ```python
 30    # Initialize RAG with a database name
 31    rag = RAG(database="my_documents")
 32        
 33    # Perform a semantic search
 34    results = rag.query("What is machine learning?", k=5)
 35    ```
 36        
 37    Using with specific embedding provider:
 38        
 39    ```python
 40    # Initialize with OpenAI embeddings
 41    rag = RAG(
 42        database="my_documents",
 43        provider="openai",
 44        vectorizer="text-embedding-ada-002"
 45    )
 46        
 47    # Search for relevant documents
 48    results = rag.query("Explain neural networks", k=10)
 49    ```
 50        
 51    Working with different vector databases:
 52        
 53    ```python
 54    # Currently supports ChromaDB
 55    rag = RAG(
 56        database="my_collection",
 57        vector_db="chroma",
 58        provider="openai",
 59        vectorizer="text-embedding-ada-002"
 60    )
 61    ```
 62
 63    Add RAG to a model, so that the model can use the RAG automatically to answer questions:
 64    ```python
 65    model = Model(provider="openai", model="gpt-4o-mini")
 66    model._add_rag(RAG(database="my_documents", vector_db="chroma"))
 67    ```
 68
 69    """
 70
 71    def __init__(self, 
 72                database: str,
 73                 provider: Optional[str] = None,
 74                 vectorizer: Optional[str] = None, 
 75                 vector_db: str = "chroma"):
 76        """
 77        Initialize the RAG system.
 78        
 79        Parameters:
 80        -----------
 81        database : str
 82            Name of the vector database/collection to use for storage and retrieval.
 83            This will be created if it doesn't exist.
 84            
 85        provider : str, optional
 86            The embedding provider to use (e.g., "openai", "anthropic", "cohere").
 87            If provided, the corresponding API key will be loaded automatically.
 88            If None, the system will use default embedding settings.
 89            
 90        vectorizer : str, optional
 91            The specific embedding model to use for vectorization.
 92            Examples: "text-embedding-ada-002", "text-embedding-3-small", "embed-english-v3.0"
 93            If None, the provider's default model will be used.
 94            
 95        vector_db : str, default="chroma"
 96            The vector database backend to use. Currently supports:
 97            - "chroma": ChromaDB (default, recommended for most use cases)
 98            
 99        Raises:
100        -------
101        ValueError
102            If an unsupported vector database is specified.
103            
104        Examples:
105        ---------
106        ```python
107        # Minimal initialization
108        rag = RAG("my_documents")
109        
110        # With OpenAI embeddings
111        rag = RAG(
112            database="research_papers",
113            provider="openai",
114            vectorizer="text-embedding-ada-002"
115        )
116        
117        # With Anthropic embeddings
118        rag = RAG(
119            database="articles",
120            provider="anthropic",
121            vectorizer="text-embedding-3-small"
122        )
123        ```
124        """
125        if provider:
126            load_key(provider)
127            
128        self._vectorizer = vectorizer
129        self._db = database
130        
131        if vector_db == "chroma":
132            self._vector_db = ChromaVectorDB(
133                name=database, 
134                vectorizer_provider=provider, 
135                vectorizer_model=vectorizer
136            )
137        else:
138            raise ValueError(f"Vector database '{vector_db}' not supported. Currently only 'chroma' is supported.")
139        
140
141    def query(self, query: str, k: int = 10) -> Dict[str, Any]:
142        """
143        Perform a semantic search query against the vector database.
144        
145        This method converts the input query into a vector embedding and searches
146        the database for the most semantically similar documents.
147        
148        Parameters:
149        -----------
150        query : str
151            The text query to search for. This will be converted to a vector
152            embedding and used to find similar documents.
153            
154        k : int, default=10
155            The number of most relevant documents to return. Higher values
156            return more results but may include less relevant documents.
157            
158        Returns:
159        --------
160        Dict[str, Any]
161            A dictionary containing the search results with the following structure:
162            {
163                'ids': List[List[str]] - Document IDs of the retrieved documents,
164                'documents': List[List[str]] - The actual document content,
165                'metadatas': List[List[Dict]] - Metadata for each document,
166                'distances': List[List[float]] - Similarity scores (lower = more similar)
167            }
168            
169        Examples:
170        ---------
171        ```python
172        # Basic query
173        results = rag.query("What is artificial intelligence?")
174        
175        # Query with more results
176        results = rag.query("Machine learning algorithms", k=20)
177        
178        # Accessing results
179        for i, (doc_id, document, metadata, distance) in enumerate(zip(
180            results['ids'][0], 
181            results['documents'][0], 
182            results['metadatas'][0], 
183            results['distances'][0]
184        )):
185            print(f"Result {i+1}:")
186            print(f"  ID: {doc_id}")
187            print(f"  Content: {document[:100]}...")
188            print(f"  Similarity: {1 - distance:.3f}")
189            print(f"  Metadata: {metadata}")
190            print()
191        ```
192        
193        Notes:
194        ------
195        - The query is automatically converted to lowercase and processed
196        - Results are returned in order of relevance (most similar first)
197        - Distance scores are cosine distances (0 = identical, 2 = completely opposite)
198        - If fewer than k documents exist in the database, all available documents are returned
199        """
200        return self._vector_db.query(query, k)

Retrieval-Augmented Generation (RAG) system for semantic search and document retrieval.

This class provides a high-level interface for performing semantic search queries against a vector database. It supports multiple vector database backends and embedding providers for flexible deployment scenarios.

The RAG system works by:

  1. Converting text queries into vector embeddings
  2. Searching the vector database for similar document embeddings
  3. Returning the most relevant documents based on semantic similarity

Attributes: _vectorizer (str): The embedding model used for vectorization _db (str): Name of the vector database _vector_db (ChromaVectorDB): The vector database backend

Examples:

Basic usage with default settings:

# Initialize RAG with a database name
rag = RAG(database="my_documents")

# Perform a semantic search
results = rag.query("What is machine learning?", k=5)

Using with specific embedding provider:

# Initialize with OpenAI embeddings
rag = RAG(
    database="my_documents",
    provider="openai",
    vectorizer="text-embedding-ada-002"
)

# Search for relevant documents
results = rag.query("Explain neural networks", k=10)

Working with different vector databases:

# Currently supports ChromaDB
rag = RAG(
    database="my_collection",
    vector_db="chroma",
    provider="openai",
    vectorizer="text-embedding-ada-002"
)

Add RAG to a model, so that the model can use the RAG automatically to answer questions:

model = Model(provider="openai", model="gpt-4o-mini")
model._add_rag(RAG(database="my_documents", vector_db="chroma"))
RAG( database: str, provider: Optional[str] = None, vectorizer: Optional[str] = None, vector_db: str = 'chroma')
 71    def __init__(self, 
 72                database: str,
 73                 provider: Optional[str] = None,
 74                 vectorizer: Optional[str] = None, 
 75                 vector_db: str = "chroma"):
 76        """
 77        Initialize the RAG system.
 78        
 79        Parameters:
 80        -----------
 81        database : str
 82            Name of the vector database/collection to use for storage and retrieval.
 83            This will be created if it doesn't exist.
 84            
 85        provider : str, optional
 86            The embedding provider to use (e.g., "openai", "anthropic", "cohere").
 87            If provided, the corresponding API key will be loaded automatically.
 88            If None, the system will use default embedding settings.
 89            
 90        vectorizer : str, optional
 91            The specific embedding model to use for vectorization.
 92            Examples: "text-embedding-ada-002", "text-embedding-3-small", "embed-english-v3.0"
 93            If None, the provider's default model will be used.
 94            
 95        vector_db : str, default="chroma"
 96            The vector database backend to use. Currently supports:
 97            - "chroma": ChromaDB (default, recommended for most use cases)
 98            
 99        Raises:
100        -------
101        ValueError
102            If an unsupported vector database is specified.
103            
104        Examples:
105        ---------
106        ```python
107        # Minimal initialization
108        rag = RAG("my_documents")
109        
110        # With OpenAI embeddings
111        rag = RAG(
112            database="research_papers",
113            provider="openai",
114            vectorizer="text-embedding-ada-002"
115        )
116        
117        # With Anthropic embeddings
118        rag = RAG(
119            database="articles",
120            provider="anthropic",
121            vectorizer="text-embedding-3-small"
122        )
123        ```
124        """
125        if provider:
126            load_key(provider)
127            
128        self._vectorizer = vectorizer
129        self._db = database
130        
131        if vector_db == "chroma":
132            self._vector_db = ChromaVectorDB(
133                name=database, 
134                vectorizer_provider=provider, 
135                vectorizer_model=vectorizer
136            )
137        else:
138            raise ValueError(f"Vector database '{vector_db}' not supported. Currently only 'chroma' is supported.")

Initialize the RAG system.

Parameters:

database : str Name of the vector database/collection to use for storage and retrieval. This will be created if it doesn't exist.

provider : str, optional The embedding provider to use (e.g., "openai", "anthropic", "cohere"). If provided, the corresponding API key will be loaded automatically. If None, the system will use default embedding settings.

vectorizer : str, optional The specific embedding model to use for vectorization. Examples: "text-embedding-ada-002", "text-embedding-3-small", "embed-english-v3.0" If None, the provider's default model will be used.

vector_db : str, default="chroma" The vector database backend to use. Currently supports: - "chroma": ChromaDB (default, recommended for most use cases)

Raises:

ValueError If an unsupported vector database is specified.

Examples:

# Minimal initialization
rag = RAG("my_documents")

# With OpenAI embeddings
rag = RAG(
    database="research_papers",
    provider="openai",
    vectorizer="text-embedding-ada-002"
)

# With Anthropic embeddings
rag = RAG(
    database="articles",
    provider="anthropic",
    vectorizer="text-embedding-3-small"
)
def query(self, query: str, k: int = 10) -> Dict[str, Any]:
141    def query(self, query: str, k: int = 10) -> Dict[str, Any]:
142        """
143        Perform a semantic search query against the vector database.
144        
145        This method converts the input query into a vector embedding and searches
146        the database for the most semantically similar documents.
147        
148        Parameters:
149        -----------
150        query : str
151            The text query to search for. This will be converted to a vector
152            embedding and used to find similar documents.
153            
154        k : int, default=10
155            The number of most relevant documents to return. Higher values
156            return more results but may include less relevant documents.
157            
158        Returns:
159        --------
160        Dict[str, Any]
161            A dictionary containing the search results with the following structure:
162            {
163                'ids': List[List[str]] - Document IDs of the retrieved documents,
164                'documents': List[List[str]] - The actual document content,
165                'metadatas': List[List[Dict]] - Metadata for each document,
166                'distances': List[List[float]] - Similarity scores (lower = more similar)
167            }
168            
169        Examples:
170        ---------
171        ```python
172        # Basic query
173        results = rag.query("What is artificial intelligence?")
174        
175        # Query with more results
176        results = rag.query("Machine learning algorithms", k=20)
177        
178        # Accessing results
179        for i, (doc_id, document, metadata, distance) in enumerate(zip(
180            results['ids'][0], 
181            results['documents'][0], 
182            results['metadatas'][0], 
183            results['distances'][0]
184        )):
185            print(f"Result {i+1}:")
186            print(f"  ID: {doc_id}")
187            print(f"  Content: {document[:100]}...")
188            print(f"  Similarity: {1 - distance:.3f}")
189            print(f"  Metadata: {metadata}")
190            print()
191        ```
192        
193        Notes:
194        ------
195        - The query is automatically converted to lowercase and processed
196        - Results are returned in order of relevance (most similar first)
197        - Distance scores are cosine distances (0 = identical, 2 = completely opposite)
198        - If fewer than k documents exist in the database, all available documents are returned
199        """
200        return self._vector_db.query(query, k)

Perform a semantic search query against the vector database.

This method converts the input query into a vector embedding and searches the database for the most semantically similar documents.

Parameters:

query : str The text query to search for. This will be converted to a vector embedding and used to find similar documents.

k : int, default=10 The number of most relevant documents to return. Higher values return more results but may include less relevant documents.

Returns:

Dict[str, Any] A dictionary containing the search results with the following structure: { 'ids': List[List[str]] - Document IDs of the retrieved documents, 'documents': List[List[str]] - The actual document content, 'metadatas': List[List[Dict]] - Metadata for each document, 'distances': List[List[float]] - Similarity scores (lower = more similar) }

Examples:

# Basic query
results = rag.query("What is artificial intelligence?")

# Query with more results
results = rag.query("Machine learning algorithms", k=20)

# Accessing results
for i, (doc_id, document, metadata, distance) in enumerate(zip(
    results['ids'][0], 
    results['documents'][0], 
    results['metadatas'][0], 
    results['distances'][0]
)):
    print(f"Result {i+1}:")
    print(f"  ID: {doc_id}")
    print(f"  Content: {document[:100]}...")
    print(f"  Similarity: {1 - distance:.3f}")
    print(f"  Metadata: {metadata}")
    print()

Notes:

  • The query is automatically converted to lowercase and processed
  • Results are returned in order of relevance (most similar first)
  • Distance scores are cosine distances (0 = identical, 2 = completely opposite)
  • If fewer than k documents exist in the database, all available documents are returned
class ChromaVectorDB(monoai.rag.vectordb._BaseVectorDB):
173class ChromaVectorDB(_BaseVectorDB):
174    """
175    ChromaDB implementation of the vector database interface.
176    
177    This class provides a concrete implementation of the vector database
178    using ChromaDB as the backend. ChromaDB is an open-source embedding
179    database that supports persistent storage and efficient similarity search.
180    
181    Features:
182    - Persistent storage of document embeddings
183    - Efficient similarity search with configurable result count
184    - Metadata storage for each document
185    - Automatic collection creation if it doesn't exist
186    - Support for custom embedding models via LiteLLM
187    
188    Attributes:
189        _client (chromadb.PersistentClient): ChromaDB client instance
190        _collection (chromadb.Collection): Active collection for operations
191    
192    Examples:
193    --------
194    Basic usage:
195    
196    ```python
197    # Initialize with a new collection
198    vector_db = ChromaVectorDB(name="my_documents")
199    
200    # Add documents
201    documents = ["Document 1 content", "Document 2 content"]
202    metadatas = [{"source": "file1.txt"}, {"source": "file2.txt"}]
203    ids = ["doc1", "doc2"]
204    
205    vector_db.add(documents, metadatas, ids)
206    
207    # Search for similar documents
208    results = vector_db.query("search query", k=5)
209    ```
210    
211    Using with specific embedding model:
212    
213    ```python
214    # Initialize with OpenAI embeddings
215    vector_db = ChromaVectorDB(
216        name="research_papers",
217        vectorizer_provider="openai",
218        vectorizer_model="text-embedding-ada-002"
219    )
220    ```
221    """
222
223    def __init__(self, name: Optional[str] = None, 
224                 vectorizer_provider: Optional[str] = None, 
225                 vectorizer_model: Optional[str] = None):
226        """
227        Initialize the ChromaDB vector database.
228        
229        Parameters:
230        -----------
231        name : str, optional
232            Name of the ChromaDB collection. If provided, the collection
233            will be created if it doesn't exist, or connected to if it does.
234            
235        vectorizer_provider : str, optional
236            The embedding provider to use for vectorization.
237            Examples: "openai", "anthropic", "cohere"
238            
239        vectorizer_model : str, optional
240            The specific embedding model to use.
241            Examples: "text-embedding-ada-002", "text-embedding-3-small"
242            
243        Examples:
244        ---------
245        ```python
246        # Create new collection
247        vector_db = ChromaVectorDB("my_documents")
248        
249        # Connect to existing collection with custom embeddings
250        vector_db = ChromaVectorDB(
251            name="existing_collection",
252            vectorizer_provider="openai",
253            vectorizer_model="text-embedding-ada-002"
254        )
255        ```
256        """
257        super().__init__(name, vectorizer_provider, vectorizer_model)
258        self._client = chromadb.PersistentClient()
259        if name:
260            try:
261                self._collection = self._client.get_collection(name)
262            except chromadb.errors.NotFoundError:
263                self._collection = self._client.create_collection(name)
264
265    def add(self, documents: List[str], metadatas: List[Dict], ids: List[str]) -> None:
266        """
267        Add documents to the ChromaDB collection.
268        
269        This method adds documents along with their metadata and IDs to the
270        ChromaDB collection. The documents are automatically converted to
271        embeddings using the configured embedding model.
272        
273        Parameters:
274        -----------
275        documents : List[str]
276            List of text documents to add to the database.
277            Each document will be converted to a vector embedding.
278            
279        metadatas : List[Dict]
280            List of metadata dictionaries for each document.
281            Each metadata dict can contain any key-value pairs for
282            document categorization and filtering.
283            
284        ids : List[str]
285            List of unique identifiers for each document.
286            IDs must be unique within the collection.
287            
288        Raises:
289        -------
290        ValueError
291            If the lengths of documents, metadatas, and ids don't match.
292            
293        Examples:
294        ---------
295        ```python
296        # Add documents with metadata
297        documents = [
298            "Machine learning is a subset of artificial intelligence.",
299            "Deep learning uses neural networks with multiple layers."
300        ]
301        
302        metadatas = [
303            {"topic": "machine_learning", "source": "textbook", "year": 2023},
304            {"topic": "deep_learning", "source": "research_paper", "year": 2023}
305        ]
306        
307        ids = ["doc_001", "doc_002"]
308        
309        vector_db.add(documents, metadatas, ids)
310        ```
311        
312        Notes:
313        ------
314        - All three lists must have the same length
315        - IDs must be unique within the collection
316        - Documents are automatically embedded using the configured model
317        - Metadata can be used for filtering during queries
318        """
319        if not (len(documents) == len(metadatas) == len(ids)):
320            raise ValueError("documents, metadatas, and ids must have the same length")
321            
322        self._collection.add(
323            documents=documents,
324            metadatas=metadatas,
325            ids=ids
326        )
327
328    def query(self, query: str, k: int = 10) -> Dict[str, Any]:
329        """
330        Search for similar documents in the ChromaDB collection.
331        
332        This method performs semantic search by converting the query to an
333        embedding and finding the most similar document embeddings in the
334        collection.
335        
336        Parameters:
337        -----------
338        query : str
339            The text query to search for. This will be converted to a
340            vector embedding and compared against stored documents.
341            
342        k : int, default=10
343            Number of most similar documents to return. Higher values
344            return more results but may include less relevant documents.
345            
346        Returns:
347        --------
348        Dict[str, Any]
349            A dictionary containing search results with the following structure:
350            {
351                'ids': List[List[str]] - Document IDs of retrieved documents,
352                'documents': List[List[str]] - The actual document content,
353                'metadatas': List[List[Dict]] - Metadata for each document,
354                'distances': List[List[float]] - Similarity scores (lower = more similar)
355            }
356            
357        Examples:
358        ---------
359        ```python
360        # Basic search
361        results = vector_db.query("What is machine learning?", k=5)
362        
363        # Access results
364        for i, (doc_id, document, metadata, distance) in enumerate(zip(
365            results['ids'][0], 
366            results['documents'][0], 
367            results['metadatas'][0], 
368            results['distances'][0]
369        )):
370            print(f"Result {i+1}:")
371            print(f"  ID: {doc_id}")
372            print(f"  Content: {document[:100]}...")
373            print(f"  Similarity: {1 - distance:.3f}")
374            print(f"  Metadata: {metadata}")
375        ```
376        
377        Notes:
378        ------
379        - Results are returned in order of similarity (most similar first)
380        - Distance scores are cosine distances (0 = identical, 2 = opposite)
381        - If fewer than k documents exist, all available documents are returned
382        - The query is automatically embedded using the same model as stored documents
383        """
384        results = self._collection.query(
385            query_texts=query,
386            n_results=k
387        )
388        return results

ChromaDB implementation of the vector database interface.

This class provides a concrete implementation of the vector database using ChromaDB as the backend. ChromaDB is an open-source embedding database that supports persistent storage and efficient similarity search.

Features:

  • Persistent storage of document embeddings
  • Efficient similarity search with configurable result count
  • Metadata storage for each document
  • Automatic collection creation if it doesn't exist
  • Support for custom embedding models via LiteLLM

Attributes: _client (chromadb.PersistentClient): ChromaDB client instance _collection (chromadb.Collection): Active collection for operations

Examples:

Basic usage:

# Initialize with a new collection
vector_db = ChromaVectorDB(name="my_documents")

# Add documents
documents = ["Document 1 content", "Document 2 content"]
metadatas = [{"source": "file1.txt"}, {"source": "file2.txt"}]
ids = ["doc1", "doc2"]

vector_db.add(documents, metadatas, ids)

# Search for similar documents
results = vector_db.query("search query", k=5)

Using with specific embedding model:

# Initialize with OpenAI embeddings
vector_db = ChromaVectorDB(
    name="research_papers",
    vectorizer_provider="openai",
    vectorizer_model="text-embedding-ada-002"
)
ChromaVectorDB( name: Optional[str] = None, vectorizer_provider: Optional[str] = None, vectorizer_model: Optional[str] = None)
223    def __init__(self, name: Optional[str] = None, 
224                 vectorizer_provider: Optional[str] = None, 
225                 vectorizer_model: Optional[str] = None):
226        """
227        Initialize the ChromaDB vector database.
228        
229        Parameters:
230        -----------
231        name : str, optional
232            Name of the ChromaDB collection. If provided, the collection
233            will be created if it doesn't exist, or connected to if it does.
234            
235        vectorizer_provider : str, optional
236            The embedding provider to use for vectorization.
237            Examples: "openai", "anthropic", "cohere"
238            
239        vectorizer_model : str, optional
240            The specific embedding model to use.
241            Examples: "text-embedding-ada-002", "text-embedding-3-small"
242            
243        Examples:
244        ---------
245        ```python
246        # Create new collection
247        vector_db = ChromaVectorDB("my_documents")
248        
249        # Connect to existing collection with custom embeddings
250        vector_db = ChromaVectorDB(
251            name="existing_collection",
252            vectorizer_provider="openai",
253            vectorizer_model="text-embedding-ada-002"
254        )
255        ```
256        """
257        super().__init__(name, vectorizer_provider, vectorizer_model)
258        self._client = chromadb.PersistentClient()
259        if name:
260            try:
261                self._collection = self._client.get_collection(name)
262            except chromadb.errors.NotFoundError:
263                self._collection = self._client.create_collection(name)

Initialize the ChromaDB vector database.

Parameters:

name : str, optional Name of the ChromaDB collection. If provided, the collection will be created if it doesn't exist, or connected to if it does.

vectorizer_provider : str, optional The embedding provider to use for vectorization. Examples: "openai", "anthropic", "cohere"

vectorizer_model : str, optional The specific embedding model to use. Examples: "text-embedding-ada-002", "text-embedding-3-small"

Examples:

# Create new collection
vector_db = ChromaVectorDB("my_documents")

# Connect to existing collection with custom embeddings
vector_db = ChromaVectorDB(
    name="existing_collection",
    vectorizer_provider="openai",
    vectorizer_model="text-embedding-ada-002"
)
def add( self, documents: List[str], metadatas: List[Dict], ids: List[str]) -> None:
265    def add(self, documents: List[str], metadatas: List[Dict], ids: List[str]) -> None:
266        """
267        Add documents to the ChromaDB collection.
268        
269        This method adds documents along with their metadata and IDs to the
270        ChromaDB collection. The documents are automatically converted to
271        embeddings using the configured embedding model.
272        
273        Parameters:
274        -----------
275        documents : List[str]
276            List of text documents to add to the database.
277            Each document will be converted to a vector embedding.
278            
279        metadatas : List[Dict]
280            List of metadata dictionaries for each document.
281            Each metadata dict can contain any key-value pairs for
282            document categorization and filtering.
283            
284        ids : List[str]
285            List of unique identifiers for each document.
286            IDs must be unique within the collection.
287            
288        Raises:
289        -------
290        ValueError
291            If the lengths of documents, metadatas, and ids don't match.
292            
293        Examples:
294        ---------
295        ```python
296        # Add documents with metadata
297        documents = [
298            "Machine learning is a subset of artificial intelligence.",
299            "Deep learning uses neural networks with multiple layers."
300        ]
301        
302        metadatas = [
303            {"topic": "machine_learning", "source": "textbook", "year": 2023},
304            {"topic": "deep_learning", "source": "research_paper", "year": 2023}
305        ]
306        
307        ids = ["doc_001", "doc_002"]
308        
309        vector_db.add(documents, metadatas, ids)
310        ```
311        
312        Notes:
313        ------
314        - All three lists must have the same length
315        - IDs must be unique within the collection
316        - Documents are automatically embedded using the configured model
317        - Metadata can be used for filtering during queries
318        """
319        if not (len(documents) == len(metadatas) == len(ids)):
320            raise ValueError("documents, metadatas, and ids must have the same length")
321            
322        self._collection.add(
323            documents=documents,
324            metadatas=metadatas,
325            ids=ids
326        )

Add documents to the ChromaDB collection.

This method adds documents along with their metadata and IDs to the ChromaDB collection. The documents are automatically converted to embeddings using the configured embedding model.

Parameters:

documents : List[str] List of text documents to add to the database. Each document will be converted to a vector embedding.

metadatas : List[Dict] List of metadata dictionaries for each document. Each metadata dict can contain any key-value pairs for document categorization and filtering.

ids : List[str] List of unique identifiers for each document. IDs must be unique within the collection.

Raises:

ValueError If the lengths of documents, metadatas, and ids don't match.

Examples:

# Add documents with metadata
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with multiple layers."
]

metadatas = [
    {"topic": "machine_learning", "source": "textbook", "year": 2023},
    {"topic": "deep_learning", "source": "research_paper", "year": 2023}
]

ids = ["doc_001", "doc_002"]

vector_db.add(documents, metadatas, ids)

Notes:

  • All three lists must have the same length
  • IDs must be unique within the collection
  • Documents are automatically embedded using the configured model
  • Metadata can be used for filtering during queries
def query(self, query: str, k: int = 10) -> Dict[str, Any]:
328    def query(self, query: str, k: int = 10) -> Dict[str, Any]:
329        """
330        Search for similar documents in the ChromaDB collection.
331        
332        This method performs semantic search by converting the query to an
333        embedding and finding the most similar document embeddings in the
334        collection.
335        
336        Parameters:
337        -----------
338        query : str
339            The text query to search for. This will be converted to a
340            vector embedding and compared against stored documents.
341            
342        k : int, default=10
343            Number of most similar documents to return. Higher values
344            return more results but may include less relevant documents.
345            
346        Returns:
347        --------
348        Dict[str, Any]
349            A dictionary containing search results with the following structure:
350            {
351                'ids': List[List[str]] - Document IDs of retrieved documents,
352                'documents': List[List[str]] - The actual document content,
353                'metadatas': List[List[Dict]] - Metadata for each document,
354                'distances': List[List[float]] - Similarity scores (lower = more similar)
355            }
356            
357        Examples:
358        ---------
359        ```python
360        # Basic search
361        results = vector_db.query("What is machine learning?", k=5)
362        
363        # Access results
364        for i, (doc_id, document, metadata, distance) in enumerate(zip(
365            results['ids'][0], 
366            results['documents'][0], 
367            results['metadatas'][0], 
368            results['distances'][0]
369        )):
370            print(f"Result {i+1}:")
371            print(f"  ID: {doc_id}")
372            print(f"  Content: {document[:100]}...")
373            print(f"  Similarity: {1 - distance:.3f}")
374            print(f"  Metadata: {metadata}")
375        ```
376        
377        Notes:
378        ------
379        - Results are returned in order of similarity (most similar first)
380        - Distance scores are cosine distances (0 = identical, 2 = opposite)
381        - If fewer than k documents exist, all available documents are returned
382        - The query is automatically embedded using the same model as stored documents
383        """
384        results = self._collection.query(
385            query_texts=query,
386            n_results=k
387        )
388        return results

Search for similar documents in the ChromaDB collection.

This method performs semantic search by converting the query to an embedding and finding the most similar document embeddings in the collection.

Parameters:

query : str The text query to search for. This will be converted to a vector embedding and compared against stored documents.

k : int, default=10 Number of most similar documents to return. Higher values return more results but may include less relevant documents.

Returns:

Dict[str, Any] A dictionary containing search results with the following structure: { 'ids': List[List[str]] - Document IDs of retrieved documents, 'documents': List[List[str]] - The actual document content, 'metadatas': List[List[Dict]] - Metadata for each document, 'distances': List[List[float]] - Similarity scores (lower = more similar) }

Examples:

# Basic search
results = vector_db.query("What is machine learning?", k=5)

# Access results
for i, (doc_id, document, metadata, distance) in enumerate(zip(
    results['ids'][0], 
    results['documents'][0], 
    results['metadatas'][0], 
    results['distances'][0]
)):
    print(f"Result {i+1}:")
    print(f"  ID: {doc_id}")
    print(f"  Content: {document[:100]}...")
    print(f"  Similarity: {1 - distance:.3f}")
    print(f"  Metadata: {metadata}")

Notes:

  • Results are returned in order of similarity (most similar first)
  • Distance scores are cosine distances (0 = identical, 2 = opposite)
  • If fewer than k documents exist, all available documents are returned
  • The query is automatically embedded using the same model as stored documents
class DocumentsBuilder:
  25class DocumentsBuilder:
  26    """
  27    A utility class for building document collections from various sources.
  28    
  29    This class provides methods to extract text content from files and web pages,
  30    split the content into manageable chunks with configurable size and overlap,
  31    and prepare the data for storage in vector databases.
  32    
  33    The DocumentsBuilder is designed to work seamlessly with the RAG system,
  34    producing output that can be directly used with vector database operations.
  35    
  36    Features:
  37    - File-based document extraction with UTF-8 encoding support
  38    - Text string processing for in-memory content
  39    - Web scraping with multiple engine options (requests, tavily, selenium)
  40    - Word document extraction (.doc and .docx formats)
  41    - PDF document extraction with metadata
  42    - Multiple chunking strategies (word, sentence, paragraph, fixed, semantic)
  43    - Configurable chunk size and overlap parameters
  44    - Rich metadata generation for each document chunk
  45    - Unique ID generation for database storage
  46    
  47    Attributes:
  48        _chunk_strategy (str): The chunking strategy to use
  49        _chunk_size (int): Maximum size of each text chunk in characters
  50        _chunk_overlap (int): Number of characters to overlap between chunks
  51    
  52    Examples:
  53    --------
  54    Basic usage with file processing:
  55    
  56    ```python
  57    # Initialize with default chunk settings (word-based)
  58    builder = DocumentsBuilder()
  59    
  60    # Process a text file
  61    documents, metadatas, ids = builder.from_file("document.txt")
  62    
  63    # Add to vector database
  64    vector_db.add(documents, metadatas, ids)
  65    ```
  66    
  67    Text string processing:
  68    
  69    ```python
  70    # Process text strings directly
  71    text_content = "This is a long text that needs to be processed..."
  72    documents, metadatas, ids = builder.from_str(text_content)
  73    
  74    # Process with custom source name
  75    documents, metadatas, ids = builder.from_str(
  76        text_content,
  77        source_name="user_input"
  78    )
  79    ```
  80    
  81    Different chunking strategies:
  82    
  83    ```python
  84    # Default settings (word-based chunking)
  85    builder = DocumentsBuilder()
  86    
  87    # Sentence-based chunking (5 sentences per chunk)
  88    builder = DocumentsBuilder(chunk_strategy="sentence", chunk_size=5)
  89    
  90    # Paragraph-based chunking (3 paragraphs per chunk)
  91    builder = DocumentsBuilder(chunk_strategy="paragraph", chunk_size=3)
  92    
  93    # Fixed-size chunks (800 characters per chunk)
  94    builder = DocumentsBuilder(chunk_strategy="fixed", chunk_size=800)
  95    
  96    # Word-based chunks (50 words per chunk)
  97    builder = DocumentsBuilder(chunk_strategy="word", chunk_size=50)
  98    ```
  99    
 100    Web scraping with different engines:
 101    
 102    ```python
 103    # Basic web scraping
 104    documents, metadatas, ids = builder.from_url("https://example.com")
 105    
 106    # Advanced scraping with Tavily
 107    documents, metadatas, ids = builder.from_url(
 108        "https://example.com",
 109        engine="tavily",
 110        deep=True
 111    )
 112    
 113    # JavaScript-heavy sites with Selenium
 114    documents, metadatas, ids = builder.from_url(
 115        "https://spa-example.com",
 116        engine="selenium"
 117    )
 118    ```
 119    
 120    Word document processing:
 121    
 122    ```python
 123    # Process Word documents
 124    documents, metadatas, ids = builder.from_doc("document.docx")
 125    documents, metadatas, ids = builder.from_doc("document.doc")
 126    
 127    # Process with custom extraction method
 128    documents, metadatas, ids = builder.from_doc(
 129        "document.docx",
 130        extraction_method="docx2txt"
 131    )
 132    ```
 133    
 134    PDF document processing:
 135    
 136    ```python
 137    # Process PDF documents
 138    documents, metadatas, ids = builder.from_pdf("document.pdf")
 139    
 140    # Process with page range
 141    documents, metadatas, ids = builder.from_pdf(
 142        "document.pdf",
 143        page_range=(1, 10)  # Extract pages 1-10
 144    )
 145    ```
 146    
 147    Notes:
 148    ------
 149    - chunk_overlap should typically be 10-20% of chunk_size
 150    - chunk_overlap must be less than chunk_size to prevent infinite loops
 151    - Different strategies interpret chunk_size differently:
 152      * word: chunk_size = number of words per chunk
 153      * sentence: chunk_size = number of sentences per chunk
 154      * paragraph: chunk_size = number of paragraphs per chunk
 155      * fixed: chunk_size = number of characters per chunk
 156      * semantic: chunk_size = number of characters per chunk
 157    - Very small chunks may lose context
 158    - Very large chunks may be less focused for retrieval
 159    - Fixed and semantic strategies always produce chunks of exactly chunk_size (except the last one)
 160    - Word document processing requires python-docx and python-docx2txt packages
 161    - PDF processing requires PyPDF2 package
 162    """
 163
 164    def __init__(self, chunk_strategy: str = "word", chunk_size: int = 1000, chunk_overlap: int = 0, custom_split_func: Optional[callable] = None):
 165        """
 166        Initialize the DocumentsBuilder with chunking parameters.
 167        
 168        Parameters:
 169        -----------
 170        chunk_strategy : str, default="word"
 171            The strategy to use for text chunking:
 172            - "word": Break at word boundaries (spaces and newlines) when possible
 173            - "sentence": Break at sentence boundaries (periods, exclamation marks, question marks)
 174            - "paragraph": Break at paragraph boundaries (double newlines)
 175            - "fixed": Break at exact character count without considering boundaries
 176            - "semantic": Break at semantic boundaries (headers, sections, etc.)
 177            - "custom": Use the provided custom_split_func for chunking
 178            
 179        chunk_size : int, default=1000
 180            The size limit for each chunk, interpreted differently based on strategy:
 181            - "word": Maximum number of words per chunk
 182            - "sentence": Maximum number of sentences per chunk  
 183            - "paragraph": Maximum number of paragraphs per chunk
 184            - "fixed": Maximum number of characters per chunk
 185            - "semantic": Maximum number of characters per chunk
 186            - "custom": Passed to custom_split_func as a parameter
 187            
 188        chunk_overlap : int, default=0
 189            The overlap between consecutive chunks, interpreted based on strategy:
 190            - "word": Number of words to overlap
 191            - "sentence": Number of sentences to overlap
 192            - "paragraph": Number of paragraphs to overlap
 193            - "fixed": Number of characters to overlap
 194            - "semantic": Number of characters to overlap
 195            - "custom": Passed to custom_split_func as a parameter
 196            
 197        custom_split_func : callable, optional
 198            Custom function to use for text splitting. If provided, automatically sets chunk_strategy to "custom"
 199            regardless of the chunk_strategy parameter value.
 200            The function should have the signature: func(text: str, chunk_size: int, chunk_overlap: int) -> List[str]
 201            and return a list of text chunks.
 202            
 203        Raises:
 204        -------
 205        ValueError
 206            If chunk_overlap >= chunk_size (would cause infinite loops)
 207            If chunk_size <= 0
 208            If chunk_overlap < 0
 209            If chunk_strategy="custom" but no custom_split_func is provided
 210            
 211        Examples:
 212        ---------
 213        ```python
 214        # Default settings (word-based chunking)
 215        builder = DocumentsBuilder()
 216        
 217        # Sentence-based chunking (5 sentences per chunk)
 218        builder = DocumentsBuilder(chunk_strategy="sentence", chunk_size=5)
 219        
 220        # Paragraph-based chunking (3 paragraphs per chunk)
 221        builder = DocumentsBuilder(chunk_strategy="paragraph", chunk_size=3)
 222        
 223        # Fixed-size chunks (800 characters per chunk)
 224        builder = DocumentsBuilder(chunk_strategy="fixed", chunk_size=800)
 225        
 226        # Word-based chunks (50 words per chunk)
 227        builder = DocumentsBuilder(chunk_strategy="word", chunk_size=50)
 228        
 229        # Custom chunking function
 230        def my_custom_split(text, chunk_size, chunk_overlap):
 231            # Split by lines and then by chunk_size
 232            lines = text.split('\n')
 233            chunks = []
 234            for i in range(0, len(lines), chunk_size - chunk_overlap):
 235                chunk_lines = lines[i:i + chunk_size]
 236                chunks.append('\n'.join(chunk_lines))
 237            return chunks
 238        
 239        # Strategy automatically set to "custom" when custom_split_func is provided
 240        builder = DocumentsBuilder(
 241            chunk_size=100,
 242            chunk_overlap=10,
 243            custom_split_func=my_custom_split
 244        )
 245        
 246        # Or explicitly set strategy (will be overridden to "custom")
 247        builder = DocumentsBuilder(
 248            chunk_strategy="word",  # This will be ignored
 249            chunk_size=100,
 250            chunk_overlap=10,
 251            custom_split_func=my_custom_split  # Strategy becomes "custom"
 252        )
 253        ```
 254        
 255        Notes:
 256        ------
 257        - chunk_overlap should typically be 10-20% of chunk_size
 258        - chunk_overlap must be less than chunk_size to prevent infinite loops
 259        - Different strategies interpret chunk_size differently:
 260          * word: chunk_size = number of words per chunk
 261          * sentence: chunk_size = number of sentences per chunk
 262          * paragraph: chunk_size = number of paragraphs per chunk
 263          * fixed: chunk_size = number of characters per chunk
 264          * semantic: chunk_size = number of characters per chunk
 265          * custom: chunk_size is passed to custom_split_func
 266        - Very small chunks may lose context
 267        - Very large chunks may be less focused for retrieval
 268        - Fixed and semantic strategies always produce chunks of exactly chunk_size (except the last one)
 269        - Custom functions should handle their own overlap logic
 270        - Custom functions can implement any splitting logic:
 271          * Split by specific delimiters (e.g., "---", "###")
 272          * Split by regex patterns
 273          * Split by semantic boundaries using NLP libraries
 274          * Split by document structure (headers, sections, etc.)
 275          * Combine multiple strategies
 276        - When custom_split_func is provided, chunk_strategy is automatically set to "custom"
 277          regardless of the chunk_strategy parameter value
 278        """
 279        # If custom_split_func is provided, automatically set strategy to "custom"
 280        if custom_split_func is not None:
 281            chunk_strategy = "custom"
 282        
 283        self._chunk_strategy = chunk_strategy
 284        self._chunk_size = chunk_size
 285        self._chunk_overlap = chunk_overlap
 286        self._custom_split_func = custom_split_func
 287        
 288        # Validate parameters to prevent infinite loops
 289        if chunk_overlap >= chunk_size:
 290            raise ValueError(
 291                f"chunk_overlap ({chunk_overlap}) must be less than chunk_size ({chunk_size}) "
 292                "to prevent infinite loops. Recommended: chunk_overlap should be 10-20% of chunk_size."
 293            )
 294        
 295        if chunk_size <= 0:
 296            raise ValueError(f"chunk_size must be positive, got {chunk_size}")
 297        
 298        if chunk_overlap < 0:
 299            raise ValueError(f"chunk_overlap must be non-negative, got {chunk_overlap}")
 300        
 301        # Validate custom split function
 302        if chunk_strategy == "custom" and custom_split_func is None:
 303            raise ValueError("custom_split_func must be provided when chunk_strategy='custom'")
 304        
 305        if custom_split_func is not None and not callable(custom_split_func):
 306            raise ValueError("custom_split_func must be callable")
 307
 308    def from_file(self, file_path: str) -> Tuple[List[str], List[Dict], List[str]]:
 309        """
 310        Read a file and split it into chunks with specified size and overlap.
 311        
 312        This method reads a text file from the filesystem, splits its content
 313        into chunks according to the configured parameters, and generates
 314        metadata and unique IDs for each chunk.
 315        
 316        Parameters:
 317        -----------
 318        file_path : str
 319            Path to the text file to read. The file must exist and be
 320            readable. UTF-8 encoding is assumed.
 321            
 322        Returns:
 323        --------
 324        Tuple[List[str], List[Dict], List[str]]
 325            A tuple containing:
 326            - List of document chunks (strings): The text content split into chunks
 327            - List of metadata dictionaries: Metadata for each chunk including
 328              file information and chunk details
 329            - List of unique IDs: UUID strings for each chunk
 330            
 331        Raises:
 332        -------
 333        FileNotFoundError
 334            If the specified file does not exist or is not accessible.
 335            
 336        UnicodeDecodeError
 337            If the file cannot be decoded as UTF-8.
 338            
 339        Examples:
 340        ---------
 341        ```python
 342        # Process a single file
 343        documents, metadatas, ids = builder.from_file("article.txt")
 344        
 345        # Access metadata information
 346        for i, metadata in enumerate(metadatas):
 347            print(f"Chunk {i+1}:")
 348            print(f"  File: {metadata['file_name']}")
 349            print(f"  Size: {metadata['chunk_size']} characters")
 350            print(f"  Position: {metadata['chunk_index'] + 1}/{metadata['total_chunks']}")
 351        ```
 352        
 353        Notes:
 354        ------
 355        - File is read entirely into memory before processing
 356        - Empty files will return empty lists
 357        - File path is stored in metadata for traceability
 358        - Chunk indexing starts at 0
 359        """
 360        if not os.path.exists(file_path):
 361            raise FileNotFoundError(f"File not found: {file_path}")
 362        
 363        # Read the file content
 364        with open(file_path, 'r', encoding='utf-8') as file:
 365            text = file.read()
 366        
 367        # Split text into chunks
 368        chunks = self._split_text(text)
 369        
 370        # Generate metadata and IDs for each chunk
 371        documents = []
 372        metadatas = []
 373        ids = []
 374        
 375        for i, chunk in enumerate(chunks):
 376            # Generate unique ID
 377            chunk_id = str(uuid.uuid4())
 378            
 379            # Create metadata
 380            metadata = {
 381                'file_path': file_path,
 382                'file_name': os.path.basename(file_path),
 383                'chunk_index': i,
 384                'total_chunks': len(chunks),
 385                'chunk_size': len(chunk)
 386            }
 387            
 388            documents.append(chunk)
 389            metadatas.append(metadata)
 390            ids.append(chunk_id)
 391        
 392        return documents, metadatas, ids
 393
 394    def from_str(self, text: str, source_name: str = "text_string") -> Tuple[List[str], List[Dict], List[str]]:
 395        """
 396        Process a text string and split it into chunks with specified size and overlap.
 397        
 398        This method takes a text string directly and processes it using the same
 399        chunking logic as file processing. It's useful when you already have
 400        text content in memory and want to prepare it for vector database storage.
 401        
 402        Parameters:
 403        -----------
 404        text : str
 405            The text content to process and split into chunks.
 406            
 407        source_name : str, default="text_string"
 408            A descriptive name for the text source. This will be included
 409            in the metadata for traceability and identification.
 410            
 411        Returns:
 412        --------
 413        Tuple[List[str], List[Dict], List[str]]
 414            A tuple containing:
 415            - List of document chunks (strings): The text content split into chunks
 416            - List of metadata dictionaries: Metadata for each chunk including
 417              source information and chunk details
 418            - List of unique IDs: UUID strings for each chunk
 419            
 420        Examples:
 421        ---------
 422        ```python
 423        # Process a simple text string
 424        text_content = "This is a long text that needs to be processed..."
 425        documents, metadatas, ids = builder.from_str(text_content)
 426        
 427        # Process with custom source name
 428        documents, metadatas, ids = builder.from_str(
 429            text_content,
 430            source_name="user_input"
 431        )
 432        
 433        # Process multiple text strings
 434        text_parts = [
 435            "First part of the document...",
 436            "Second part of the document...",
 437            "Third part of the document..."
 438        ]
 439        
 440        all_documents = []
 441        all_metadatas = []
 442        all_ids = []
 443        
 444        for i, text_part in enumerate(text_parts):
 445            documents, metadatas, ids = builder.from_str(
 446                text_part,
 447                source_name=f"document_part_{i+1}"
 448            )
 449            all_documents.extend(documents)
 450            all_metadatas.extend(metadatas)
 451            all_ids.extend(ids)
 452        ```
 453        
 454        Notes:
 455        ------
 456        - Uses the same chunking strategy and parameters as other methods
 457        - Empty strings will return empty lists
 458        - Source name is stored in metadata for identification
 459        - Useful for processing text from APIs, user input, or generated content
 460        """
 461        if not text or not text.strip():
 462            return [], [], []
 463        
 464        # Split text into chunks
 465        chunks = self._split_text(text)
 466        
 467        # Generate metadata and IDs for each chunk
 468        documents = []
 469        metadatas = []
 470        ids = []
 471        
 472        for i, chunk in enumerate(chunks):
 473            # Generate unique ID
 474            chunk_id = str(uuid.uuid4())
 475            
 476            # Create metadata
 477            metadata = {
 478                'source_type': 'text_string',
 479                'source_name': source_name,
 480                'chunk_index': i,
 481                'total_chunks': len(chunks),
 482                'chunk_size': len(chunk),
 483                'chunk_strategy': self._chunk_strategy
 484            }
 485            
 486            documents.append(chunk)
 487            metadatas.append(metadata)
 488            ids.append(chunk_id)
 489        
 490        return documents, metadatas, ids
 491
 492    def from_doc(self, file_path: str, extraction_method: str = "auto") -> Tuple[List[str], List[Dict], List[str]]:
 493        """
 494        Extract text from Word documents (.doc and .docx files) and split into chunks.
 495        
 496        This method supports both .doc and .docx formats using different extraction
 497        methods. For .docx files, it can use either python-docx or docx2txt libraries.
 498        For .doc files, it uses docx2txt which can handle the older format.
 499        
 500        Parameters:
 501        -----------
 502        file_path : str
 503            Path to the Word document (.doc or .docx file). The file must exist
 504            and be readable.
 505            
 506        extraction_method : str, default="auto"
 507            The method to use for text extraction:
 508            - "auto": Automatically choose the best method based on file extension
 509            - "docx": Use python-docx library (only for .docx files)
 510            - "docx2txt": Use docx2txt library (works for both .doc and .docx)
 511            
 512        Returns:
 513        --------
 514        Tuple[List[str], List[Dict], List[str]]
 515            A tuple containing:
 516            - List of document chunks (strings): The extracted text split into chunks
 517            - List of metadata dictionaries: Metadata for each chunk including
 518              file information, document properties, and chunk details
 519            - List of unique IDs: UUID strings for each chunk
 520            
 521        Raises:
 522        -------
 523        FileNotFoundError
 524            If the specified file does not exist or is not accessible.
 525            
 526        ValueError
 527            If the file is not a supported Word document format or if the
 528            required extraction method is not available.
 529            
 530        ImportError
 531            If the required libraries for the chosen extraction method are not installed.
 532            
 533        Examples:
 534        ---------
 535        ```python
 536        # Process a .docx file with automatic method selection
 537        documents, metadatas, ids = builder.from_doc("document.docx")
 538        
 539        # Process a .doc file
 540        documents, metadatas, ids = builder.from_doc("document.doc")
 541        
 542        # Force specific extraction method
 543        documents, metadatas, ids = builder.from_doc(
 544            "document.docx",
 545            extraction_method="docx2txt"
 546        )
 547        
 548        # Access document metadata
 549        for metadata in metadatas:
 550            print(f"File: {metadata['file_name']}")
 551            print(f"Format: {metadata['document_format']}")
 552            print(f"Extraction method: {metadata['extraction_method']}")
 553        ```
 554        
 555        Notes:
 556        ------
 557        - .docx files are the modern Word format (Office 2007+)
 558        - .doc files are the legacy Word format (Office 97-2003)
 559        - python-docx provides better structure preservation for .docx files
 560        - docx2txt works with both formats but may lose some formatting
 561        - Document properties (title, author, etc.) are extracted when available
 562        - Images and complex formatting are not preserved in the extracted text
 563        """
 564        if not os.path.exists(file_path):
 565            raise FileNotFoundError(f"File not found: {file_path}")
 566        
 567        # Determine file extension and validate
 568        file_extension = os.path.splitext(file_path)[1].lower()
 569        if file_extension not in ['.doc', '.docx']:
 570            raise ValueError(f"Unsupported file format: {file_extension}. Only .doc and .docx files are supported.")
 571        
 572        # Determine extraction method
 573        if extraction_method == "auto":
 574            if file_extension == '.docx' and DOCX_AVAILABLE:
 575                extraction_method = "docx"
 576            else:
 577                extraction_method = "docx2txt"
 578        
 579        # Extract text based on method
 580        if extraction_method == "docx":
 581            if not DOCX_AVAILABLE:
 582                raise ImportError("python-docx library is required for 'docx' extraction method. Install with: pip install python-docx")
 583            if file_extension != '.docx':
 584                raise ValueError("'docx' extraction method only supports .docx files")
 585            text, doc_properties = self._extract_with_docx(file_path)
 586        elif extraction_method == "docx2txt":
 587            if not DOCX2TXT_AVAILABLE:
 588                raise ImportError("docx2txt library is required for 'docx2txt' extraction method. Install with: pip install python-docx2txt")
 589            text, doc_properties = self._extract_with_docx2txt(file_path)
 590        else:
 591            raise ValueError(f"Unsupported extraction method: {extraction_method}")
 592        
 593        # Split text into chunks
 594        chunks = self._split_text(text)
 595        
 596        # Generate metadata and IDs for each chunk
 597        documents = []
 598        metadatas = []
 599        ids = []
 600        
 601        for i, chunk in enumerate(chunks):
 602            # Generate unique ID
 603            chunk_id = str(uuid.uuid4())
 604            
 605            # Create metadata
 606            metadata = {
 607                'file_path': file_path,
 608                'file_name': os.path.basename(file_path),
 609                'document_format': file_extension[1:],  # Remove the dot
 610                'extraction_method': extraction_method,
 611                'chunk_index': i,
 612                'total_chunks': len(chunks),
 613                'chunk_size': len(chunk)
 614            }
 615            
 616            # Add document properties if available
 617            if doc_properties:
 618                metadata.update(doc_properties)
 619            
 620            documents.append(chunk)
 621            metadatas.append(metadata)
 622            ids.append(chunk_id)
 623        
 624        return documents, metadatas, ids
 625
 626    def from_pdf(self, file_path: str, page_range: Optional[Tuple[int, int]] = None) -> Tuple[List[str], List[Dict], List[str]]:
 627        """
 628        Extract text from PDF documents and split into chunks.
 629        
 630        This method extracts text content from PDF files using PyPDF2 library.
 631        It supports extracting all pages or a specific range of pages, and
 632        preserves page information in the metadata.
 633        
 634        Parameters:
 635        -----------
 636        file_path : str
 637            Path to the PDF file. The file must exist and be readable.
 638            
 639        page_range : Tuple[int, int], optional
 640            Range of pages to extract (start_page, end_page), where pages are
 641            1-indexed. If None, all pages are extracted.
 642            Example: (1, 5) extracts pages 1 through 5.
 643            
 644        Returns:
 645        --------
 646        Tuple[List[str], List[Dict], List[str]]
 647            A tuple containing:
 648            - List of document chunks (strings): The extracted text split into chunks
 649            - List of metadata dictionaries: Metadata for each chunk including
 650              file information, PDF properties, page information, and chunk details
 651            - List of unique IDs: UUID strings for each chunk
 652            
 653        Raises:
 654        -------
 655        FileNotFoundError
 656            If the specified file does not exist or is not accessible.
 657            
 658        ValueError
 659            If the file is not a valid PDF or if the page range is invalid.
 660            
 661        ImportError
 662            If PyPDF2 library is not installed.
 663            
 664        Examples:
 665        ---------
 666        ```python
 667        # Process entire PDF
 668        documents, metadatas, ids = builder.from_pdf("document.pdf")
 669        
 670        # Process specific page range
 671        documents, metadatas, ids = builder.from_pdf(
 672            "document.pdf",
 673            page_range=(1, 10)  # Pages 1-10
 674        )
 675        
 676        # Process single page
 677        documents, metadatas, ids = builder.from_pdf(
 678            "document.pdf",
 679            page_range=(5, 5)  # Only page 5
 680        )
 681        
 682        # Access PDF metadata
 683        for metadata in metadatas:
 684            print(f"File: {metadata['file_name']}")
 685            print(f"Page: {metadata.get('page_number', 'N/A')}")
 686            print(f"Total pages: {metadata.get('total_pages', 'N/A')}")
 687            print(f"PDF title: {metadata.get('pdf_title', 'N/A')}")
 688        ```
 689        
 690        Notes:
 691        ------
 692        - Page numbers are 1-indexed (first page is page 1)
 693        - Text extraction quality depends on the PDF structure
 694        - Scanned PDFs may not extract text properly
 695        - PDF metadata (title, author, etc.) is extracted when available
 696        - Page information is preserved in chunk metadata
 697        - Images and complex formatting are not preserved
 698        """
 699        if not PDF_AVAILABLE:
 700            raise ImportError("PyPDF2 library is required for PDF processing. Install with: pip install PyPDF2")
 701        
 702        if not os.path.exists(file_path):
 703            raise FileNotFoundError(f"File not found: {file_path}")
 704        
 705        # Validate file extension
 706        file_extension = os.path.splitext(file_path)[1].lower()
 707        if file_extension != '.pdf':
 708            raise ValueError(f"Unsupported file format: {file_extension}. Only .pdf files are supported.")
 709        
 710        # Extract text and metadata from PDF
 711        text, pdf_properties, page_info = self._extract_from_pdf(file_path, page_range)
 712        
 713        # Split text into chunks
 714        chunks = self._split_text(text)
 715        
 716        # Generate metadata and IDs for each chunk
 717        documents = []
 718        metadatas = []
 719        ids = []
 720        
 721        for i, chunk in enumerate(chunks):
 722            # Generate unique ID
 723            chunk_id = str(uuid.uuid4())
 724            
 725            # Create metadata
 726            metadata = {
 727                'file_path': file_path,
 728                'file_name': os.path.basename(file_path),
 729                'document_format': 'pdf',
 730                'chunk_index': i,
 731                'total_chunks': len(chunks),
 732                'chunk_size': len(chunk)
 733            }
 734            
 735            # Add PDF properties if available
 736            if pdf_properties:
 737                metadata.update(pdf_properties)
 738            
 739            # Add page information if available
 740            if page_info:
 741                metadata.update(page_info)
 742            
 743            documents.append(chunk)
 744            metadatas.append(metadata)
 745            ids.append(chunk_id)
 746        
 747        return documents, metadatas, ids
 748
 749    def from_url(self, url: str, engine: str = "requests", deep: bool = False) -> Tuple[List[str], List[Dict], List[str]]:
 750        """
 751        Scrape content from a URL and split it into chunks with specified size and overlap.
 752        
 753        This method uses web scraping to extract text content from a webpage,
 754        then processes the content using the same chunking logic as file processing.
 755        Multiple scraping engines are supported for different types of websites.
 756        
 757        Parameters:
 758        -----------
 759        url : str
 760            The URL to scrape. Must be a valid HTTP/HTTPS URL.
 761            
 762        engine : str, default="requests"
 763            The web scraping engine to use:
 764            - "requests": Simple HTTP requests (fast, good for static content)
 765            - "tavily": Advanced web scraping with better content extraction
 766            - "selenium": Full browser automation (good for JavaScript-heavy sites)
 767            
 768        deep : bool, default=False
 769            If using the "tavily" engine, whether to use advanced extraction mode.
 770            Deep extraction provides better content quality but is slower.
 771            
 772        Returns:
 773        --------
 774        Tuple[List[str], List[Dict], List[str]]
 775            A tuple containing:
 776            - List of document chunks (strings): The scraped text split into chunks
 777            - List of metadata dictionaries: Metadata for each chunk including
 778              URL information and scraping details
 779            - List of unique IDs: UUID strings for each chunk
 780            
 781        Raises:
 782        -------
 783        ValueError
 784            If the scraping fails or no text content is extracted.
 785            
 786        Examples:
 787        ---------
 788        ```python
 789        # Basic web scraping
 790        documents, metadatas, ids = builder.from_url("https://example.com")
 791        
 792        # Advanced scraping with Tavily
 793        documents, metadatas, ids = builder.from_url(
 794            "https://blog.example.com",
 795            engine="tavily",
 796            deep=True
 797        )
 798        
 799        # JavaScript-heavy site with Selenium
 800        documents, metadatas, ids = builder.from_url(
 801            "https://spa.example.com",
 802            engine="selenium"
 803        )
 804        
 805        # Access scraping metadata
 806        for metadata in metadatas:
 807            print(f"Source: {metadata['url']}")
 808            print(f"Engine: {metadata['scraping_engine']}")
 809            print(f"Deep extraction: {metadata['deep_extraction']}")
 810        ```
 811        
 812        Notes:
 813        ------
 814        - Scraping may take time depending on the engine and website complexity
 815        - Some websites may block automated scraping
 816        - Selenium requires Chrome/Chromium to be installed
 817        - Tavily requires an API key to be configured
 818        """
 819        # Initialize WebScraping with specified engine
 820        scraper = WebScraping(engine=engine, deep=deep)
 821        
 822        # Scrape the URL
 823        result = scraper.scrape(url)
 824        
 825        if not result or not result.get("text"):
 826            raise ValueError(f"Failed to extract text content from URL: {url}")
 827        
 828        text = result["text"]
 829        
 830        # Split text into chunks
 831        chunks = self._split_text(text)
 832        
 833        # Generate metadata and IDs for each chunk
 834        documents = []
 835        metadatas = []
 836        ids = []
 837        
 838        for i, chunk in enumerate(chunks):
 839            # Generate unique ID
 840            chunk_id = str(uuid.uuid4())
 841            
 842            # Create metadata
 843            metadata = {
 844                'url': url,
 845                'source_type': 'web_page',
 846                'scraping_engine': engine,
 847                'deep_extraction': deep,
 848                'chunk_index': i,
 849                'total_chunks': len(chunks),
 850                'chunk_size': len(chunk)
 851            }
 852            
 853            documents.append(chunk)
 854            metadatas.append(metadata)
 855            ids.append(chunk_id)
 856        
 857        return documents, metadatas, ids
 858    
 859    def _extract_with_docx(self, file_path: str) -> Tuple[str, Dict]:
 860        """
 861        Extract text from a .docx file using python-docx library.
 862        
 863        Parameters:
 864        -----------
 865        file_path : str
 866            Path to the .docx file
 867            
 868        Returns:
 869        --------
 870        Tuple[str, Dict]
 871            A tuple containing the extracted text and document properties
 872        """
 873        doc = Document(file_path)
 874        
 875        # Extract text from paragraphs
 876        text_parts = []
 877        for paragraph in doc.paragraphs:
 878            if paragraph.text.strip():
 879                text_parts.append(paragraph.text)
 880        
 881        # Extract text from tables
 882        for table in doc.tables:
 883            for row in table.rows:
 884                row_text = []
 885                for cell in row.cells:
 886                    if cell.text.strip():
 887                        row_text.append(cell.text.strip())
 888                if row_text:
 889                    text_parts.append(" | ".join(row_text))
 890        
 891        text = "\n\n".join(text_parts)
 892        
 893        # Extract document properties
 894        properties = {}
 895        core_props = doc.core_properties
 896        if core_props.title:
 897            properties['document_title'] = core_props.title
 898        if core_props.author:
 899            properties['document_author'] = core_props.author
 900        if core_props.subject:
 901            properties['document_subject'] = core_props.subject
 902        if core_props.created:
 903            properties['document_created'] = str(core_props.created)
 904        if core_props.modified:
 905            properties['document_modified'] = str(core_props.modified)
 906        
 907        return text, properties
 908    
 909    def _extract_with_docx2txt(self, file_path: str) -> Tuple[str, Dict]:
 910        """
 911        Extract text from a Word document (.doc or .docx) using docx2txt library.
 912        
 913        Parameters:
 914        -----------
 915        file_path : str
 916            Path to the Word document
 917            
 918        Returns:
 919        --------
 920        Tuple[str, Dict]
 921            A tuple containing the extracted text and document properties
 922        """
 923        text = docx2txt.process(file_path)
 924        
 925        # docx2txt doesn't provide document properties, so return empty dict
 926        properties = {}
 927        
 928        return text, properties
 929    
 930    def _extract_from_pdf(self, file_path: str, page_range: Optional[Tuple[int, int]] = None) -> Tuple[str, Dict, Dict]:
 931        """
 932        Extract text and metadata from a PDF file using PyPDF2.
 933        
 934        Parameters:
 935        -----------
 936        file_path : str
 937            Path to the PDF file
 938            
 939        page_range : Tuple[int, int], optional
 940            Range of pages to extract (start_page, end_page), 1-indexed
 941            
 942        Returns:
 943        --------
 944        Tuple[str, Dict, Dict]
 945            A tuple containing the extracted text, PDF properties, and page information
 946        """
 947        with open(file_path, 'rb') as file:
 948            pdf_reader = PyPDF2.PdfReader(file)
 949            
 950            # Get total number of pages
 951            total_pages = len(pdf_reader.pages)
 952            
 953            # Determine page range
 954            if page_range is None:
 955                start_page = 1
 956                end_page = total_pages
 957            else:
 958                start_page, end_page = page_range
 959                # Validate page range
 960                if start_page < 1 or end_page > total_pages or start_page > end_page:
 961                    raise ValueError(f"Invalid page range: {page_range}. Pages must be between 1 and {total_pages}")
 962            
 963            # Extract text from specified pages
 964            text_parts = []
 965            for page_num in range(start_page - 1, end_page):  # Convert to 0-indexed
 966                page = pdf_reader.pages[page_num]
 967                page_text = page.extract_text()
 968                if page_text.strip():
 969                    text_parts.append(page_text)
 970            
 971            text = "\n\n".join(text_parts)
 972            
 973            # Extract PDF properties
 974            properties = {}
 975            if pdf_reader.metadata:
 976                metadata = pdf_reader.metadata
 977                if '/Title' in metadata:
 978                    properties['pdf_title'] = metadata['/Title']
 979                if '/Author' in metadata:
 980                    properties['pdf_author'] = metadata['/Author']
 981                if '/Subject' in metadata:
 982                    properties['pdf_subject'] = metadata['/Subject']
 983                if '/Creator' in metadata:
 984                    properties['pdf_creator'] = metadata['/Creator']
 985                if '/Producer' in metadata:
 986                    properties['pdf_producer'] = metadata['/Producer']
 987                if '/CreationDate' in metadata:
 988                    properties['pdf_creation_date'] = str(metadata['/CreationDate'])
 989                if '/ModDate' in metadata:
 990                    properties['pdf_modification_date'] = str(metadata['/ModDate'])
 991            
 992            # Add page information
 993            page_info = {
 994                'total_pages': total_pages,
 995                'extracted_pages_start': start_page,
 996                'extracted_pages_end': end_page,
 997                'extracted_pages_count': end_page - start_page + 1
 998            }
 999            
1000            return text, properties, page_info
1001    
1002    def _split_text(self, text: str) -> List[str]:
1003        """
1004        Split text into chunks using the specified chunking strategy.
1005        
1006        This private method implements different text chunking algorithms based
1007        on the configured chunk_strategy. It supports word, sentence, paragraph,
1008        fixed, and semantic chunking strategies.
1009        
1010        Parameters:
1011        -----------
1012        text : str
1013            The text content to split into chunks.
1014            
1015        Returns:
1016        --------
1017        List[str]
1018            List of text chunks based on the selected strategy.
1019            Empty chunks are automatically filtered out.
1020            
1021        Examples:
1022        ---------
1023        ```python
1024        # Internal usage (called by from_file, from_doc, from_pdf, and from_url)
1025        chunks = builder._split_text("This is a long text that needs to be split...")
1026        print(f"Created {len(chunks)} chunks using {builder._chunk_strategy} strategy")
1027        ```
1028        
1029        Notes:
1030        ------
1031        - Chunks are stripped of leading/trailing whitespace
1032        - Empty chunks are automatically filtered out
1033        - Different strategies have different characteristics:
1034          * word: Preserves word boundaries, good for general use
1035          * sentence: Preserves sentence context, good for Q&A
1036          * paragraph: Preserves paragraph context, good for document structure
1037          * fixed: Exact size control, may break words/sentences
1038          * semantic: Attempts to preserve semantic meaning
1039        """
1040        if len(text) <= self._chunk_size:
1041            return [text]
1042        
1043        if self._chunk_strategy == "word":
1044            return self._split_by_words(text)
1045        elif self._chunk_strategy == "sentence":
1046            return self._split_by_sentences(text)
1047        elif self._chunk_strategy == "paragraph":
1048            return self._split_by_paragraphs(text)
1049        elif self._chunk_strategy == "fixed":
1050            return self._split_fixed(text)
1051        elif self._chunk_strategy == "semantic":
1052            return self._split_semantic(text)
1053        elif self._chunk_strategy == "custom":
1054            return self._custom_split_func(text, self._chunk_size, self._chunk_overlap)
1055        else:
1056            raise ValueError(f"Unsupported chunk strategy: {self._chunk_strategy}")
1057    
1058    def _split_by_words(self, text: str) -> List[str]:
1059        """
1060        Split text by word boundaries while respecting word count.
1061        
1062        This strategy splits text into chunks based on the number of words,
1063        trying to break at word boundaries when possible.
1064        """
1065        # Split text into words
1066        words = text.split()
1067        
1068        if len(words) <= self._chunk_size:
1069            return [text]
1070        
1071        chunks = []
1072        start_word = 0
1073        
1074        while start_word < len(words):
1075            # Calculate end word position for current chunk
1076            end_word = start_word + self._chunk_size
1077            
1078            # Extract words for this chunk
1079            chunk_words = words[start_word:end_word]
1080            chunk = ' '.join(chunk_words)
1081            
1082            if chunk.strip():  # Only add non-empty chunks
1083                chunks.append(chunk)
1084            
1085            # Calculate next start position with overlap
1086            new_start_word = end_word - self._chunk_overlap
1087            
1088            # Ensure we always advance to prevent infinite loops
1089            if new_start_word <= start_word:
1090                new_start_word = start_word + 1
1091            
1092            start_word = new_start_word
1093            
1094            # Safety check to prevent infinite loops
1095            if start_word >= len(words):
1096                break
1097        
1098        return chunks
1099    
1100    def _split_by_sentences(self, text: str) -> List[str]:
1101        """
1102        Split text by sentence boundaries while respecting sentence count.
1103        
1104        This strategy splits text into chunks based on the number of sentences,
1105        preserving sentence integrity.
1106        """
1107        # Define sentence endings
1108        sentence_endings = ['.', '!', '?', '\n\n']
1109        
1110        # Split text into sentences
1111        sentences = []
1112        last_pos = 0
1113        
1114        for i, char in enumerate(text):
1115            if char in sentence_endings:
1116                sentence = text[last_pos:i+1].strip()
1117                if sentence:
1118                    sentences.append(sentence)
1119                last_pos = i + 1
1120        
1121        # Add the last sentence if it doesn't end with punctuation
1122        if last_pos < len(text):
1123            last_sentence = text[last_pos:].strip()
1124            if last_sentence:
1125                sentences.append(last_sentence)
1126        
1127        if len(sentences) <= self._chunk_size:
1128            return [text]
1129        
1130        chunks = []
1131        start_sentence = 0
1132        
1133        while start_sentence < len(sentences):
1134            # Calculate end sentence position for current chunk
1135            end_sentence = start_sentence + self._chunk_size
1136            
1137            # Extract sentences for this chunk
1138            chunk_sentences = sentences[start_sentence:end_sentence]
1139            chunk = ' '.join(chunk_sentences)
1140            
1141            if chunk.strip():  # Only add non-empty chunks
1142                chunks.append(chunk)
1143            
1144            # Calculate next start position with overlap
1145            new_start_sentence = end_sentence - self._chunk_overlap
1146            
1147            # Ensure we always advance to prevent infinite loops
1148            if new_start_sentence <= start_sentence:
1149                new_start_sentence = start_sentence + 1
1150            
1151            start_sentence = new_start_sentence
1152            
1153            # Safety check to prevent infinite loops
1154            if start_sentence >= len(sentences):
1155                break
1156        
1157        return chunks
1158    
1159    def _split_by_paragraphs(self, text: str) -> List[str]:
1160        """
1161        Split text by paragraph boundaries while respecting paragraph count.
1162        
1163        This strategy splits text into chunks based on the number of paragraphs,
1164        preserving paragraph integrity.
1165        """
1166        # Split by paragraph boundaries (double newlines)
1167        paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
1168        
1169        if len(paragraphs) <= self._chunk_size:
1170            return [text]
1171        
1172        chunks = []
1173        start_paragraph = 0
1174        
1175        while start_paragraph < len(paragraphs):
1176            # Calculate end paragraph position for current chunk
1177            end_paragraph = start_paragraph + self._chunk_size
1178            
1179            # Extract paragraphs for this chunk
1180            chunk_paragraphs = paragraphs[start_paragraph:end_paragraph]
1181            chunk = '\n\n'.join(chunk_paragraphs)
1182            
1183            if chunk.strip():  # Only add non-empty chunks
1184                chunks.append(chunk)
1185            
1186            # Calculate next start position with overlap
1187            new_start_paragraph = end_paragraph - self._chunk_overlap
1188            
1189            # Ensure we always advance to prevent infinite loops
1190            if new_start_paragraph <= start_paragraph:
1191                new_start_paragraph = start_paragraph + 1
1192            
1193            start_paragraph = new_start_paragraph
1194            
1195            # Safety check to prevent infinite loops
1196            if start_paragraph >= len(paragraphs):
1197                break
1198        
1199        return chunks
1200    
1201    def _split_fixed(self, text: str) -> List[str]:
1202        """
1203        Split text into fixed-size chunks without considering boundaries.
1204        
1205        This strategy creates chunks of exactly chunk_size characters
1206        (except possibly the last chunk) without trying to preserve
1207        word or sentence boundaries.
1208        """
1209        chunks = []
1210        start = 0
1211        
1212        while start < len(text):
1213            end = start + self._chunk_size
1214            chunk = text[start:end].strip()
1215            
1216            if chunk:  # Only add non-empty chunks
1217                chunks.append(chunk)
1218            
1219            # Calculate next start position with overlap
1220            new_start = end - self._chunk_overlap
1221            
1222            # Ensure we always advance to prevent infinite loops
1223            if new_start <= start:
1224                new_start = start + 1
1225            
1226            start = new_start
1227            
1228            # Safety check to prevent infinite loops
1229            if start >= len(text):
1230                break
1231        
1232        return chunks
1233    
1234    def _split_semantic(self, text: str) -> List[str]:
1235        """
1236        Split text by semantic boundaries.
1237        
1238        This strategy attempts to break text at semantic boundaries like
1239        headers, section breaks, and other structural elements while
1240        respecting the chunk size.
1241        """
1242        # Define semantic break patterns
1243        semantic_patterns = [
1244            '\n# ', '\n## ', '\n### ', '\n#### ',  # Markdown headers
1245            '\n1. ', '\n2. ', '\n3. ', '\n4. ', '\n5. ',  # Numbered lists
1246            '\n• ', '\n- ', '\n* ',  # Bullet points
1247            '\n\n',  # Paragraph breaks
1248            '\n---\n', '\n___\n',  # Horizontal rules
1249            '\n\nChapter ', '\n\nSection ', '\n\nPart ',  # Document sections
1250        ]
1251        
1252        chunks = []
1253        current_chunk = ""
1254        
1255        # Split text by semantic patterns
1256        parts = [text]
1257        for pattern in semantic_patterns:
1258            new_parts = []
1259            for part in parts:
1260                if pattern in part:
1261                    split_parts = part.split(pattern)
1262                    for i, split_part in enumerate(split_parts):
1263                        if i > 0:  # Add the pattern back to all parts except the first
1264                            split_part = pattern + split_part
1265                        if split_part.strip():
1266                            new_parts.append(split_part)
1267                else:
1268                    new_parts.append(part)
1269            parts = new_parts
1270        
1271        # Group parts into chunks
1272        for part in parts:
1273            # If adding this part would exceed chunk size, start a new chunk
1274            if len(current_chunk) + len(part) > self._chunk_size and current_chunk:
1275                chunks.append(current_chunk.strip())
1276                # Start new chunk with overlap
1277                overlap_start = max(0, len(current_chunk) - self._chunk_overlap)
1278                current_chunk = current_chunk[overlap_start:] + part
1279            else:
1280                current_chunk += part
1281        
1282        # Add the last chunk
1283        if current_chunk.strip():
1284            chunks.append(current_chunk.strip())
1285        
1286        return chunks

A utility class for building document collections from various sources.

This class provides methods to extract text content from files and web pages, split the content into manageable chunks with configurable size and overlap, and prepare the data for storage in vector databases.

The DocumentsBuilder is designed to work seamlessly with the RAG system, producing output that can be directly used with vector database operations.

Features:

  • File-based document extraction with UTF-8 encoding support
  • Text string processing for in-memory content
  • Web scraping with multiple engine options (requests, tavily, selenium)
  • Word document extraction (.doc and .docx formats)
  • PDF document extraction with metadata
  • Multiple chunking strategies (word, sentence, paragraph, fixed, semantic)
  • Configurable chunk size and overlap parameters
  • Rich metadata generation for each document chunk
  • Unique ID generation for database storage

Attributes: _chunk_strategy (str): The chunking strategy to use _chunk_size (int): Maximum size of each text chunk in characters _chunk_overlap (int): Number of characters to overlap between chunks

Examples:

Basic usage with file processing:

# Initialize with default chunk settings (word-based)
builder = DocumentsBuilder()

# Process a text file
documents, metadatas, ids = builder.from_file("document.txt")

# Add to vector database
vector_db.add(documents, metadatas, ids)

Text string processing:

# Process text strings directly
text_content = "This is a long text that needs to be processed..."
documents, metadatas, ids = builder.from_str(text_content)

# Process with custom source name
documents, metadatas, ids = builder.from_str(
    text_content,
    source_name="user_input"
)

Different chunking strategies:

# Default settings (word-based chunking)
builder = DocumentsBuilder()

# Sentence-based chunking (5 sentences per chunk)
builder = DocumentsBuilder(chunk_strategy="sentence", chunk_size=5)

# Paragraph-based chunking (3 paragraphs per chunk)
builder = DocumentsBuilder(chunk_strategy="paragraph", chunk_size=3)

# Fixed-size chunks (800 characters per chunk)
builder = DocumentsBuilder(chunk_strategy="fixed", chunk_size=800)

# Word-based chunks (50 words per chunk)
builder = DocumentsBuilder(chunk_strategy="word", chunk_size=50)

Web scraping with different engines:

# Basic web scraping
documents, metadatas, ids = builder.from_url("https://example.com")

# Advanced scraping with Tavily
documents, metadatas, ids = builder.from_url(
    "https://example.com",
    engine="tavily",
    deep=True
)

# JavaScript-heavy sites with Selenium
documents, metadatas, ids = builder.from_url(
    "https://spa-example.com",
    engine="selenium"
)

Word document processing:

# Process Word documents
documents, metadatas, ids = builder.from_doc("document.docx")
documents, metadatas, ids = builder.from_doc("document.doc")

# Process with custom extraction method
documents, metadatas, ids = builder.from_doc(
    "document.docx",
    extraction_method="docx2txt"
)

PDF document processing:

# Process PDF documents
documents, metadatas, ids = builder.from_pdf("document.pdf")

# Process with page range
documents, metadatas, ids = builder.from_pdf(
    "document.pdf",
    page_range=(1, 10)  # Extract pages 1-10
)

Notes:

  • chunk_overlap should typically be 10-20% of chunk_size
  • chunk_overlap must be less than chunk_size to prevent infinite loops
  • Different strategies interpret chunk_size differently:
    • word: chunk_size = number of words per chunk
    • sentence: chunk_size = number of sentences per chunk
    • paragraph: chunk_size = number of paragraphs per chunk
    • fixed: chunk_size = number of characters per chunk
    • semantic: chunk_size = number of characters per chunk
  • Very small chunks may lose context
  • Very large chunks may be less focused for retrieval
  • Fixed and semantic strategies always produce chunks of exactly chunk_size (except the last one)
  • Word document processing requires python-docx and python-docx2txt packages
  • PDF processing requires PyPDF2 package
DocumentsBuilder( chunk_strategy: str = 'word', chunk_size: int = 1000, chunk_overlap: int = 0, custom_split_func: Optional[<built-in function callable>] = None)
164    def __init__(self, chunk_strategy: str = "word", chunk_size: int = 1000, chunk_overlap: int = 0, custom_split_func: Optional[callable] = None):
165        """
166        Initialize the DocumentsBuilder with chunking parameters.
167        
168        Parameters:
169        -----------
170        chunk_strategy : str, default="word"
171            The strategy to use for text chunking:
172            - "word": Break at word boundaries (spaces and newlines) when possible
173            - "sentence": Break at sentence boundaries (periods, exclamation marks, question marks)
174            - "paragraph": Break at paragraph boundaries (double newlines)
175            - "fixed": Break at exact character count without considering boundaries
176            - "semantic": Break at semantic boundaries (headers, sections, etc.)
177            - "custom": Use the provided custom_split_func for chunking
178            
179        chunk_size : int, default=1000
180            The size limit for each chunk, interpreted differently based on strategy:
181            - "word": Maximum number of words per chunk
182            - "sentence": Maximum number of sentences per chunk  
183            - "paragraph": Maximum number of paragraphs per chunk
184            - "fixed": Maximum number of characters per chunk
185            - "semantic": Maximum number of characters per chunk
186            - "custom": Passed to custom_split_func as a parameter
187            
188        chunk_overlap : int, default=0
189            The overlap between consecutive chunks, interpreted based on strategy:
190            - "word": Number of words to overlap
191            - "sentence": Number of sentences to overlap
192            - "paragraph": Number of paragraphs to overlap
193            - "fixed": Number of characters to overlap
194            - "semantic": Number of characters to overlap
195            - "custom": Passed to custom_split_func as a parameter
196            
197        custom_split_func : callable, optional
198            Custom function to use for text splitting. If provided, automatically sets chunk_strategy to "custom"
199            regardless of the chunk_strategy parameter value.
200            The function should have the signature: func(text: str, chunk_size: int, chunk_overlap: int) -> List[str]
201            and return a list of text chunks.
202            
203        Raises:
204        -------
205        ValueError
206            If chunk_overlap >= chunk_size (would cause infinite loops)
207            If chunk_size <= 0
208            If chunk_overlap < 0
209            If chunk_strategy="custom" but no custom_split_func is provided
210            
211        Examples:
212        ---------
213        ```python
214        # Default settings (word-based chunking)
215        builder = DocumentsBuilder()
216        
217        # Sentence-based chunking (5 sentences per chunk)
218        builder = DocumentsBuilder(chunk_strategy="sentence", chunk_size=5)
219        
220        # Paragraph-based chunking (3 paragraphs per chunk)
221        builder = DocumentsBuilder(chunk_strategy="paragraph", chunk_size=3)
222        
223        # Fixed-size chunks (800 characters per chunk)
224        builder = DocumentsBuilder(chunk_strategy="fixed", chunk_size=800)
225        
226        # Word-based chunks (50 words per chunk)
227        builder = DocumentsBuilder(chunk_strategy="word", chunk_size=50)
228        
229        # Custom chunking function
230        def my_custom_split(text, chunk_size, chunk_overlap):
231            # Split by lines and then by chunk_size
232            lines = text.split('\n')
233            chunks = []
234            for i in range(0, len(lines), chunk_size - chunk_overlap):
235                chunk_lines = lines[i:i + chunk_size]
236                chunks.append('\n'.join(chunk_lines))
237            return chunks
238        
239        # Strategy automatically set to "custom" when custom_split_func is provided
240        builder = DocumentsBuilder(
241            chunk_size=100,
242            chunk_overlap=10,
243            custom_split_func=my_custom_split
244        )
245        
246        # Or explicitly set strategy (will be overridden to "custom")
247        builder = DocumentsBuilder(
248            chunk_strategy="word",  # This will be ignored
249            chunk_size=100,
250            chunk_overlap=10,
251            custom_split_func=my_custom_split  # Strategy becomes "custom"
252        )
253        ```
254        
255        Notes:
256        ------
257        - chunk_overlap should typically be 10-20% of chunk_size
258        - chunk_overlap must be less than chunk_size to prevent infinite loops
259        - Different strategies interpret chunk_size differently:
260          * word: chunk_size = number of words per chunk
261          * sentence: chunk_size = number of sentences per chunk
262          * paragraph: chunk_size = number of paragraphs per chunk
263          * fixed: chunk_size = number of characters per chunk
264          * semantic: chunk_size = number of characters per chunk
265          * custom: chunk_size is passed to custom_split_func
266        - Very small chunks may lose context
267        - Very large chunks may be less focused for retrieval
268        - Fixed and semantic strategies always produce chunks of exactly chunk_size (except the last one)
269        - Custom functions should handle their own overlap logic
270        - Custom functions can implement any splitting logic:
271          * Split by specific delimiters (e.g., "---", "###")
272          * Split by regex patterns
273          * Split by semantic boundaries using NLP libraries
274          * Split by document structure (headers, sections, etc.)
275          * Combine multiple strategies
276        - When custom_split_func is provided, chunk_strategy is automatically set to "custom"
277          regardless of the chunk_strategy parameter value
278        """
279        # If custom_split_func is provided, automatically set strategy to "custom"
280        if custom_split_func is not None:
281            chunk_strategy = "custom"
282        
283        self._chunk_strategy = chunk_strategy
284        self._chunk_size = chunk_size
285        self._chunk_overlap = chunk_overlap
286        self._custom_split_func = custom_split_func
287        
288        # Validate parameters to prevent infinite loops
289        if chunk_overlap >= chunk_size:
290            raise ValueError(
291                f"chunk_overlap ({chunk_overlap}) must be less than chunk_size ({chunk_size}) "
292                "to prevent infinite loops. Recommended: chunk_overlap should be 10-20% of chunk_size."
293            )
294        
295        if chunk_size <= 0:
296            raise ValueError(f"chunk_size must be positive, got {chunk_size}")
297        
298        if chunk_overlap < 0:
299            raise ValueError(f"chunk_overlap must be non-negative, got {chunk_overlap}")
300        
301        # Validate custom split function
302        if chunk_strategy == "custom" and custom_split_func is None:
303            raise ValueError("custom_split_func must be provided when chunk_strategy='custom'")
304        
305        if custom_split_func is not None and not callable(custom_split_func):
306            raise ValueError("custom_split_func must be callable")

Initialize the DocumentsBuilder with chunking parameters.

    Parameters:
    -----------
    chunk_strategy : str, default="word"
        The strategy to use for text chunking:
        - "word": Break at word boundaries (spaces and newlines) when possible
        - "sentence": Break at sentence boundaries (periods, exclamation marks, question marks)
        - "paragraph": Break at paragraph boundaries (double newlines)
        - "fixed": Break at exact character count without considering boundaries
        - "semantic": Break at semantic boundaries (headers, sections, etc.)
        - "custom": Use the provided custom_split_func for chunking

    chunk_size : int, default=1000
        The size limit for each chunk, interpreted differently based on strategy:
        - "word": Maximum number of words per chunk
        - "sentence": Maximum number of sentences per chunk  
        - "paragraph": Maximum number of paragraphs per chunk
        - "fixed": Maximum number of characters per chunk
        - "semantic": Maximum number of characters per chunk
        - "custom": Passed to custom_split_func as a parameter

    chunk_overlap : int, default=0
        The overlap between consecutive chunks, interpreted based on strategy:
        - "word": Number of words to overlap
        - "sentence": Number of sentences to overlap
        - "paragraph": Number of paragraphs to overlap
        - "fixed": Number of characters to overlap
        - "semantic": Number of characters to overlap
        - "custom": Passed to custom_split_func as a parameter

    custom_split_func : callable, optional
        Custom function to use for text splitting. If provided, automatically sets chunk_strategy to "custom"
        regardless of the chunk_strategy parameter value.
        The function should have the signature: func(text: str, chunk_size: int, chunk_overlap: int) -> List[str]
        and return a list of text chunks.

    Raises:
    -------
    ValueError
        If chunk_overlap >= chunk_size (would cause infinite loops)
        If chunk_size <= 0
        If chunk_overlap < 0
        If chunk_strategy="custom" but no custom_split_func is provided

    Examples:
    ---------


    
        # Default settings (word-based chunking)
        builder = DocumentsBuilder()

        # Sentence-based chunking (5 sentences per chunk)
        builder = DocumentsBuilder(chunk_strategy="sentence", chunk_size=5)

        # Paragraph-based chunking (3 paragraphs per chunk)
        builder = DocumentsBuilder(chunk_strategy="paragraph", chunk_size=3)

        # Fixed-size chunks (800 characters per chunk)
        builder = DocumentsBuilder(chunk_strategy="fixed", chunk_size=800)

        # Word-based chunks (50 words per chunk)
        builder = DocumentsBuilder(chunk_strategy="word", chunk_size=50)

        # Custom chunking function
        def my_custom_split(text, chunk_size, chunk_overlap):
            # Split by lines and then by chunk_size
            lines = text.split('
')
            chunks = []
            for i in range(0, len(lines), chunk_size - chunk_overlap):
                chunk_lines = lines[i:i + chunk_size]
                chunks.append('
'.join(chunk_lines))
            return chunks

        # Strategy automatically set to "custom" when custom_split_func is provided
        builder = DocumentsBuilder(
            chunk_size=100,
            chunk_overlap=10,
            custom_split_func=my_custom_split
        )

        # Or explicitly set strategy (will be overridden to "custom")
        builder = DocumentsBuilder(
            chunk_strategy="word",  # This will be ignored
            chunk_size=100,
            chunk_overlap=10,
            custom_split_func=my_custom_split  # Strategy becomes "custom"
        )
Notes: ------ - chunk_overlap should typically be 10-20% of chunk_size - chunk_overlap must be less than chunk_size to prevent infinite loops - Different strategies interpret chunk_size differently: * word: chunk_size = number of words per chunk * sentence: chunk_size = number of sentences per chunk * paragraph: chunk_size = number of paragraphs per chunk * fixed: chunk_size = number of characters per chunk * semantic: chunk_size = number of characters per chunk * custom: chunk_size is passed to custom_split_func - Very small chunks may lose context - Very large chunks may be less focused for retrieval - Fixed and semantic strategies always produce chunks of exactly chunk_size (except the last one) - Custom functions should handle their own overlap logic - Custom functions can implement any splitting logic: * Split by specific delimiters (e.g., "---", "###") * Split by regex patterns * Split by semantic boundaries using NLP libraries * Split by document structure (headers, sections, etc.) * Combine multiple strategies - When custom_split_func is provided, chunk_strategy is automatically set to "custom" regardless of the chunk_strategy parameter value
def from_file(self, file_path: str) -> Tuple[List[str], List[Dict], List[str]]:
308    def from_file(self, file_path: str) -> Tuple[List[str], List[Dict], List[str]]:
309        """
310        Read a file and split it into chunks with specified size and overlap.
311        
312        This method reads a text file from the filesystem, splits its content
313        into chunks according to the configured parameters, and generates
314        metadata and unique IDs for each chunk.
315        
316        Parameters:
317        -----------
318        file_path : str
319            Path to the text file to read. The file must exist and be
320            readable. UTF-8 encoding is assumed.
321            
322        Returns:
323        --------
324        Tuple[List[str], List[Dict], List[str]]
325            A tuple containing:
326            - List of document chunks (strings): The text content split into chunks
327            - List of metadata dictionaries: Metadata for each chunk including
328              file information and chunk details
329            - List of unique IDs: UUID strings for each chunk
330            
331        Raises:
332        -------
333        FileNotFoundError
334            If the specified file does not exist or is not accessible.
335            
336        UnicodeDecodeError
337            If the file cannot be decoded as UTF-8.
338            
339        Examples:
340        ---------
341        ```python
342        # Process a single file
343        documents, metadatas, ids = builder.from_file("article.txt")
344        
345        # Access metadata information
346        for i, metadata in enumerate(metadatas):
347            print(f"Chunk {i+1}:")
348            print(f"  File: {metadata['file_name']}")
349            print(f"  Size: {metadata['chunk_size']} characters")
350            print(f"  Position: {metadata['chunk_index'] + 1}/{metadata['total_chunks']}")
351        ```
352        
353        Notes:
354        ------
355        - File is read entirely into memory before processing
356        - Empty files will return empty lists
357        - File path is stored in metadata for traceability
358        - Chunk indexing starts at 0
359        """
360        if not os.path.exists(file_path):
361            raise FileNotFoundError(f"File not found: {file_path}")
362        
363        # Read the file content
364        with open(file_path, 'r', encoding='utf-8') as file:
365            text = file.read()
366        
367        # Split text into chunks
368        chunks = self._split_text(text)
369        
370        # Generate metadata and IDs for each chunk
371        documents = []
372        metadatas = []
373        ids = []
374        
375        for i, chunk in enumerate(chunks):
376            # Generate unique ID
377            chunk_id = str(uuid.uuid4())
378            
379            # Create metadata
380            metadata = {
381                'file_path': file_path,
382                'file_name': os.path.basename(file_path),
383                'chunk_index': i,
384                'total_chunks': len(chunks),
385                'chunk_size': len(chunk)
386            }
387            
388            documents.append(chunk)
389            metadatas.append(metadata)
390            ids.append(chunk_id)
391        
392        return documents, metadatas, ids

Read a file and split it into chunks with specified size and overlap.

This method reads a text file from the filesystem, splits its content into chunks according to the configured parameters, and generates metadata and unique IDs for each chunk.

Parameters:

file_path : str Path to the text file to read. The file must exist and be readable. UTF-8 encoding is assumed.

Returns:

Tuple[List[str], List[Dict], List[str]] A tuple containing: - List of document chunks (strings): The text content split into chunks - List of metadata dictionaries: Metadata for each chunk including file information and chunk details - List of unique IDs: UUID strings for each chunk

Raises:

FileNotFoundError If the specified file does not exist or is not accessible.

UnicodeDecodeError If the file cannot be decoded as UTF-8.

Examples:

# Process a single file
documents, metadatas, ids = builder.from_file("article.txt")

# Access metadata information
for i, metadata in enumerate(metadatas):
    print(f"Chunk {i+1}:")
    print(f"  File: {metadata['file_name']}")
    print(f"  Size: {metadata['chunk_size']} characters")
    print(f"  Position: {metadata['chunk_index'] + 1}/{metadata['total_chunks']}")

Notes:

  • File is read entirely into memory before processing
  • Empty files will return empty lists
  • File path is stored in metadata for traceability
  • Chunk indexing starts at 0
def from_str( self, text: str, source_name: str = 'text_string') -> Tuple[List[str], List[Dict], List[str]]:
394    def from_str(self, text: str, source_name: str = "text_string") -> Tuple[List[str], List[Dict], List[str]]:
395        """
396        Process a text string and split it into chunks with specified size and overlap.
397        
398        This method takes a text string directly and processes it using the same
399        chunking logic as file processing. It's useful when you already have
400        text content in memory and want to prepare it for vector database storage.
401        
402        Parameters:
403        -----------
404        text : str
405            The text content to process and split into chunks.
406            
407        source_name : str, default="text_string"
408            A descriptive name for the text source. This will be included
409            in the metadata for traceability and identification.
410            
411        Returns:
412        --------
413        Tuple[List[str], List[Dict], List[str]]
414            A tuple containing:
415            - List of document chunks (strings): The text content split into chunks
416            - List of metadata dictionaries: Metadata for each chunk including
417              source information and chunk details
418            - List of unique IDs: UUID strings for each chunk
419            
420        Examples:
421        ---------
422        ```python
423        # Process a simple text string
424        text_content = "This is a long text that needs to be processed..."
425        documents, metadatas, ids = builder.from_str(text_content)
426        
427        # Process with custom source name
428        documents, metadatas, ids = builder.from_str(
429            text_content,
430            source_name="user_input"
431        )
432        
433        # Process multiple text strings
434        text_parts = [
435            "First part of the document...",
436            "Second part of the document...",
437            "Third part of the document..."
438        ]
439        
440        all_documents = []
441        all_metadatas = []
442        all_ids = []
443        
444        for i, text_part in enumerate(text_parts):
445            documents, metadatas, ids = builder.from_str(
446                text_part,
447                source_name=f"document_part_{i+1}"
448            )
449            all_documents.extend(documents)
450            all_metadatas.extend(metadatas)
451            all_ids.extend(ids)
452        ```
453        
454        Notes:
455        ------
456        - Uses the same chunking strategy and parameters as other methods
457        - Empty strings will return empty lists
458        - Source name is stored in metadata for identification
459        - Useful for processing text from APIs, user input, or generated content
460        """
461        if not text or not text.strip():
462            return [], [], []
463        
464        # Split text into chunks
465        chunks = self._split_text(text)
466        
467        # Generate metadata and IDs for each chunk
468        documents = []
469        metadatas = []
470        ids = []
471        
472        for i, chunk in enumerate(chunks):
473            # Generate unique ID
474            chunk_id = str(uuid.uuid4())
475            
476            # Create metadata
477            metadata = {
478                'source_type': 'text_string',
479                'source_name': source_name,
480                'chunk_index': i,
481                'total_chunks': len(chunks),
482                'chunk_size': len(chunk),
483                'chunk_strategy': self._chunk_strategy
484            }
485            
486            documents.append(chunk)
487            metadatas.append(metadata)
488            ids.append(chunk_id)
489        
490        return documents, metadatas, ids

Process a text string and split it into chunks with specified size and overlap.

This method takes a text string directly and processes it using the same chunking logic as file processing. It's useful when you already have text content in memory and want to prepare it for vector database storage.

Parameters:

text : str The text content to process and split into chunks.

source_name : str, default="text_string" A descriptive name for the text source. This will be included in the metadata for traceability and identification.

Returns:

Tuple[List[str], List[Dict], List[str]] A tuple containing: - List of document chunks (strings): The text content split into chunks - List of metadata dictionaries: Metadata for each chunk including source information and chunk details - List of unique IDs: UUID strings for each chunk

Examples:

# Process a simple text string
text_content = "This is a long text that needs to be processed..."
documents, metadatas, ids = builder.from_str(text_content)

# Process with custom source name
documents, metadatas, ids = builder.from_str(
    text_content,
    source_name="user_input"
)

# Process multiple text strings
text_parts = [
    "First part of the document...",
    "Second part of the document...",
    "Third part of the document..."
]

all_documents = []
all_metadatas = []
all_ids = []

for i, text_part in enumerate(text_parts):
    documents, metadatas, ids = builder.from_str(
        text_part,
        source_name=f"document_part_{i+1}"
    )
    all_documents.extend(documents)
    all_metadatas.extend(metadatas)
    all_ids.extend(ids)

Notes:

  • Uses the same chunking strategy and parameters as other methods
  • Empty strings will return empty lists
  • Source name is stored in metadata for identification
  • Useful for processing text from APIs, user input, or generated content
def from_doc( self, file_path: str, extraction_method: str = 'auto') -> Tuple[List[str], List[Dict], List[str]]:
492    def from_doc(self, file_path: str, extraction_method: str = "auto") -> Tuple[List[str], List[Dict], List[str]]:
493        """
494        Extract text from Word documents (.doc and .docx files) and split into chunks.
495        
496        This method supports both .doc and .docx formats using different extraction
497        methods. For .docx files, it can use either python-docx or docx2txt libraries.
498        For .doc files, it uses docx2txt which can handle the older format.
499        
500        Parameters:
501        -----------
502        file_path : str
503            Path to the Word document (.doc or .docx file). The file must exist
504            and be readable.
505            
506        extraction_method : str, default="auto"
507            The method to use for text extraction:
508            - "auto": Automatically choose the best method based on file extension
509            - "docx": Use python-docx library (only for .docx files)
510            - "docx2txt": Use docx2txt library (works for both .doc and .docx)
511            
512        Returns:
513        --------
514        Tuple[List[str], List[Dict], List[str]]
515            A tuple containing:
516            - List of document chunks (strings): The extracted text split into chunks
517            - List of metadata dictionaries: Metadata for each chunk including
518              file information, document properties, and chunk details
519            - List of unique IDs: UUID strings for each chunk
520            
521        Raises:
522        -------
523        FileNotFoundError
524            If the specified file does not exist or is not accessible.
525            
526        ValueError
527            If the file is not a supported Word document format or if the
528            required extraction method is not available.
529            
530        ImportError
531            If the required libraries for the chosen extraction method are not installed.
532            
533        Examples:
534        ---------
535        ```python
536        # Process a .docx file with automatic method selection
537        documents, metadatas, ids = builder.from_doc("document.docx")
538        
539        # Process a .doc file
540        documents, metadatas, ids = builder.from_doc("document.doc")
541        
542        # Force specific extraction method
543        documents, metadatas, ids = builder.from_doc(
544            "document.docx",
545            extraction_method="docx2txt"
546        )
547        
548        # Access document metadata
549        for metadata in metadatas:
550            print(f"File: {metadata['file_name']}")
551            print(f"Format: {metadata['document_format']}")
552            print(f"Extraction method: {metadata['extraction_method']}")
553        ```
554        
555        Notes:
556        ------
557        - .docx files are the modern Word format (Office 2007+)
558        - .doc files are the legacy Word format (Office 97-2003)
559        - python-docx provides better structure preservation for .docx files
560        - docx2txt works with both formats but may lose some formatting
561        - Document properties (title, author, etc.) are extracted when available
562        - Images and complex formatting are not preserved in the extracted text
563        """
564        if not os.path.exists(file_path):
565            raise FileNotFoundError(f"File not found: {file_path}")
566        
567        # Determine file extension and validate
568        file_extension = os.path.splitext(file_path)[1].lower()
569        if file_extension not in ['.doc', '.docx']:
570            raise ValueError(f"Unsupported file format: {file_extension}. Only .doc and .docx files are supported.")
571        
572        # Determine extraction method
573        if extraction_method == "auto":
574            if file_extension == '.docx' and DOCX_AVAILABLE:
575                extraction_method = "docx"
576            else:
577                extraction_method = "docx2txt"
578        
579        # Extract text based on method
580        if extraction_method == "docx":
581            if not DOCX_AVAILABLE:
582                raise ImportError("python-docx library is required for 'docx' extraction method. Install with: pip install python-docx")
583            if file_extension != '.docx':
584                raise ValueError("'docx' extraction method only supports .docx files")
585            text, doc_properties = self._extract_with_docx(file_path)
586        elif extraction_method == "docx2txt":
587            if not DOCX2TXT_AVAILABLE:
588                raise ImportError("docx2txt library is required for 'docx2txt' extraction method. Install with: pip install python-docx2txt")
589            text, doc_properties = self._extract_with_docx2txt(file_path)
590        else:
591            raise ValueError(f"Unsupported extraction method: {extraction_method}")
592        
593        # Split text into chunks
594        chunks = self._split_text(text)
595        
596        # Generate metadata and IDs for each chunk
597        documents = []
598        metadatas = []
599        ids = []
600        
601        for i, chunk in enumerate(chunks):
602            # Generate unique ID
603            chunk_id = str(uuid.uuid4())
604            
605            # Create metadata
606            metadata = {
607                'file_path': file_path,
608                'file_name': os.path.basename(file_path),
609                'document_format': file_extension[1:],  # Remove the dot
610                'extraction_method': extraction_method,
611                'chunk_index': i,
612                'total_chunks': len(chunks),
613                'chunk_size': len(chunk)
614            }
615            
616            # Add document properties if available
617            if doc_properties:
618                metadata.update(doc_properties)
619            
620            documents.append(chunk)
621            metadatas.append(metadata)
622            ids.append(chunk_id)
623        
624        return documents, metadatas, ids

Extract text from Word documents (.doc and .docx files) and split into chunks.

This method supports both .doc and .docx formats using different extraction methods. For .docx files, it can use either python-docx or docx2txt libraries. For .doc files, it uses docx2txt which can handle the older format.

Parameters:

file_path : str Path to the Word document (.doc or .docx file). The file must exist and be readable.

extraction_method : str, default="auto" The method to use for text extraction: - "auto": Automatically choose the best method based on file extension - "docx": Use python-docx library (only for .docx files) - "docx2txt": Use docx2txt library (works for both .doc and .docx)

Returns:

Tuple[List[str], List[Dict], List[str]] A tuple containing: - List of document chunks (strings): The extracted text split into chunks - List of metadata dictionaries: Metadata for each chunk including file information, document properties, and chunk details - List of unique IDs: UUID strings for each chunk

Raises:

FileNotFoundError If the specified file does not exist or is not accessible.

ValueError If the file is not a supported Word document format or if the required extraction method is not available.

ImportError If the required libraries for the chosen extraction method are not installed.

Examples:

# Process a .docx file with automatic method selection
documents, metadatas, ids = builder.from_doc("document.docx")

# Process a .doc file
documents, metadatas, ids = builder.from_doc("document.doc")

# Force specific extraction method
documents, metadatas, ids = builder.from_doc(
    "document.docx",
    extraction_method="docx2txt"
)

# Access document metadata
for metadata in metadatas:
    print(f"File: {metadata['file_name']}")
    print(f"Format: {metadata['document_format']}")
    print(f"Extraction method: {metadata['extraction_method']}")

Notes:

  • .docx files are the modern Word format (Office 2007+)
  • .doc files are the legacy Word format (Office 97-2003)
  • python-docx provides better structure preservation for .docx files
  • docx2txt works with both formats but may lose some formatting
  • Document properties (title, author, etc.) are extracted when available
  • Images and complex formatting are not preserved in the extracted text
def from_pdf( self, file_path: str, page_range: Optional[Tuple[int, int]] = None) -> Tuple[List[str], List[Dict], List[str]]:
626    def from_pdf(self, file_path: str, page_range: Optional[Tuple[int, int]] = None) -> Tuple[List[str], List[Dict], List[str]]:
627        """
628        Extract text from PDF documents and split into chunks.
629        
630        This method extracts text content from PDF files using PyPDF2 library.
631        It supports extracting all pages or a specific range of pages, and
632        preserves page information in the metadata.
633        
634        Parameters:
635        -----------
636        file_path : str
637            Path to the PDF file. The file must exist and be readable.
638            
639        page_range : Tuple[int, int], optional
640            Range of pages to extract (start_page, end_page), where pages are
641            1-indexed. If None, all pages are extracted.
642            Example: (1, 5) extracts pages 1 through 5.
643            
644        Returns:
645        --------
646        Tuple[List[str], List[Dict], List[str]]
647            A tuple containing:
648            - List of document chunks (strings): The extracted text split into chunks
649            - List of metadata dictionaries: Metadata for each chunk including
650              file information, PDF properties, page information, and chunk details
651            - List of unique IDs: UUID strings for each chunk
652            
653        Raises:
654        -------
655        FileNotFoundError
656            If the specified file does not exist or is not accessible.
657            
658        ValueError
659            If the file is not a valid PDF or if the page range is invalid.
660            
661        ImportError
662            If PyPDF2 library is not installed.
663            
664        Examples:
665        ---------
666        ```python
667        # Process entire PDF
668        documents, metadatas, ids = builder.from_pdf("document.pdf")
669        
670        # Process specific page range
671        documents, metadatas, ids = builder.from_pdf(
672            "document.pdf",
673            page_range=(1, 10)  # Pages 1-10
674        )
675        
676        # Process single page
677        documents, metadatas, ids = builder.from_pdf(
678            "document.pdf",
679            page_range=(5, 5)  # Only page 5
680        )
681        
682        # Access PDF metadata
683        for metadata in metadatas:
684            print(f"File: {metadata['file_name']}")
685            print(f"Page: {metadata.get('page_number', 'N/A')}")
686            print(f"Total pages: {metadata.get('total_pages', 'N/A')}")
687            print(f"PDF title: {metadata.get('pdf_title', 'N/A')}")
688        ```
689        
690        Notes:
691        ------
692        - Page numbers are 1-indexed (first page is page 1)
693        - Text extraction quality depends on the PDF structure
694        - Scanned PDFs may not extract text properly
695        - PDF metadata (title, author, etc.) is extracted when available
696        - Page information is preserved in chunk metadata
697        - Images and complex formatting are not preserved
698        """
699        if not PDF_AVAILABLE:
700            raise ImportError("PyPDF2 library is required for PDF processing. Install with: pip install PyPDF2")
701        
702        if not os.path.exists(file_path):
703            raise FileNotFoundError(f"File not found: {file_path}")
704        
705        # Validate file extension
706        file_extension = os.path.splitext(file_path)[1].lower()
707        if file_extension != '.pdf':
708            raise ValueError(f"Unsupported file format: {file_extension}. Only .pdf files are supported.")
709        
710        # Extract text and metadata from PDF
711        text, pdf_properties, page_info = self._extract_from_pdf(file_path, page_range)
712        
713        # Split text into chunks
714        chunks = self._split_text(text)
715        
716        # Generate metadata and IDs for each chunk
717        documents = []
718        metadatas = []
719        ids = []
720        
721        for i, chunk in enumerate(chunks):
722            # Generate unique ID
723            chunk_id = str(uuid.uuid4())
724            
725            # Create metadata
726            metadata = {
727                'file_path': file_path,
728                'file_name': os.path.basename(file_path),
729                'document_format': 'pdf',
730                'chunk_index': i,
731                'total_chunks': len(chunks),
732                'chunk_size': len(chunk)
733            }
734            
735            # Add PDF properties if available
736            if pdf_properties:
737                metadata.update(pdf_properties)
738            
739            # Add page information if available
740            if page_info:
741                metadata.update(page_info)
742            
743            documents.append(chunk)
744            metadatas.append(metadata)
745            ids.append(chunk_id)
746        
747        return documents, metadatas, ids

Extract text from PDF documents and split into chunks.

This method extracts text content from PDF files using PyPDF2 library. It supports extracting all pages or a specific range of pages, and preserves page information in the metadata.

Parameters:

file_path : str Path to the PDF file. The file must exist and be readable.

page_range : Tuple[int, int], optional Range of pages to extract (start_page, end_page), where pages are 1-indexed. If None, all pages are extracted. Example: (1, 5) extracts pages 1 through 5.

Returns:

Tuple[List[str], List[Dict], List[str]] A tuple containing: - List of document chunks (strings): The extracted text split into chunks - List of metadata dictionaries: Metadata for each chunk including file information, PDF properties, page information, and chunk details - List of unique IDs: UUID strings for each chunk

Raises:

FileNotFoundError If the specified file does not exist or is not accessible.

ValueError If the file is not a valid PDF or if the page range is invalid.

ImportError If PyPDF2 library is not installed.

Examples:

# Process entire PDF
documents, metadatas, ids = builder.from_pdf("document.pdf")

# Process specific page range
documents, metadatas, ids = builder.from_pdf(
    "document.pdf",
    page_range=(1, 10)  # Pages 1-10
)

# Process single page
documents, metadatas, ids = builder.from_pdf(
    "document.pdf",
    page_range=(5, 5)  # Only page 5
)

# Access PDF metadata
for metadata in metadatas:
    print(f"File: {metadata['file_name']}")
    print(f"Page: {metadata.get('page_number', 'N/A')}")
    print(f"Total pages: {metadata.get('total_pages', 'N/A')}")
    print(f"PDF title: {metadata.get('pdf_title', 'N/A')}")

Notes:

  • Page numbers are 1-indexed (first page is page 1)
  • Text extraction quality depends on the PDF structure
  • Scanned PDFs may not extract text properly
  • PDF metadata (title, author, etc.) is extracted when available
  • Page information is preserved in chunk metadata
  • Images and complex formatting are not preserved
def from_url( self, url: str, engine: str = 'requests', deep: bool = False) -> Tuple[List[str], List[Dict], List[str]]:
749    def from_url(self, url: str, engine: str = "requests", deep: bool = False) -> Tuple[List[str], List[Dict], List[str]]:
750        """
751        Scrape content from a URL and split it into chunks with specified size and overlap.
752        
753        This method uses web scraping to extract text content from a webpage,
754        then processes the content using the same chunking logic as file processing.
755        Multiple scraping engines are supported for different types of websites.
756        
757        Parameters:
758        -----------
759        url : str
760            The URL to scrape. Must be a valid HTTP/HTTPS URL.
761            
762        engine : str, default="requests"
763            The web scraping engine to use:
764            - "requests": Simple HTTP requests (fast, good for static content)
765            - "tavily": Advanced web scraping with better content extraction
766            - "selenium": Full browser automation (good for JavaScript-heavy sites)
767            
768        deep : bool, default=False
769            If using the "tavily" engine, whether to use advanced extraction mode.
770            Deep extraction provides better content quality but is slower.
771            
772        Returns:
773        --------
774        Tuple[List[str], List[Dict], List[str]]
775            A tuple containing:
776            - List of document chunks (strings): The scraped text split into chunks
777            - List of metadata dictionaries: Metadata for each chunk including
778              URL information and scraping details
779            - List of unique IDs: UUID strings for each chunk
780            
781        Raises:
782        -------
783        ValueError
784            If the scraping fails or no text content is extracted.
785            
786        Examples:
787        ---------
788        ```python
789        # Basic web scraping
790        documents, metadatas, ids = builder.from_url("https://example.com")
791        
792        # Advanced scraping with Tavily
793        documents, metadatas, ids = builder.from_url(
794            "https://blog.example.com",
795            engine="tavily",
796            deep=True
797        )
798        
799        # JavaScript-heavy site with Selenium
800        documents, metadatas, ids = builder.from_url(
801            "https://spa.example.com",
802            engine="selenium"
803        )
804        
805        # Access scraping metadata
806        for metadata in metadatas:
807            print(f"Source: {metadata['url']}")
808            print(f"Engine: {metadata['scraping_engine']}")
809            print(f"Deep extraction: {metadata['deep_extraction']}")
810        ```
811        
812        Notes:
813        ------
814        - Scraping may take time depending on the engine and website complexity
815        - Some websites may block automated scraping
816        - Selenium requires Chrome/Chromium to be installed
817        - Tavily requires an API key to be configured
818        """
819        # Initialize WebScraping with specified engine
820        scraper = WebScraping(engine=engine, deep=deep)
821        
822        # Scrape the URL
823        result = scraper.scrape(url)
824        
825        if not result or not result.get("text"):
826            raise ValueError(f"Failed to extract text content from URL: {url}")
827        
828        text = result["text"]
829        
830        # Split text into chunks
831        chunks = self._split_text(text)
832        
833        # Generate metadata and IDs for each chunk
834        documents = []
835        metadatas = []
836        ids = []
837        
838        for i, chunk in enumerate(chunks):
839            # Generate unique ID
840            chunk_id = str(uuid.uuid4())
841            
842            # Create metadata
843            metadata = {
844                'url': url,
845                'source_type': 'web_page',
846                'scraping_engine': engine,
847                'deep_extraction': deep,
848                'chunk_index': i,
849                'total_chunks': len(chunks),
850                'chunk_size': len(chunk)
851            }
852            
853            documents.append(chunk)
854            metadatas.append(metadata)
855            ids.append(chunk_id)
856        
857        return documents, metadatas, ids

Scrape content from a URL and split it into chunks with specified size and overlap.

This method uses web scraping to extract text content from a webpage, then processes the content using the same chunking logic as file processing. Multiple scraping engines are supported for different types of websites.

Parameters:

url : str The URL to scrape. Must be a valid HTTP/HTTPS URL.

engine : str, default="requests" The web scraping engine to use: - "requests": Simple HTTP requests (fast, good for static content) - "tavily": Advanced web scraping with better content extraction - "selenium": Full browser automation (good for JavaScript-heavy sites)

deep : bool, default=False If using the "tavily" engine, whether to use advanced extraction mode. Deep extraction provides better content quality but is slower.

Returns:

Tuple[List[str], List[Dict], List[str]] A tuple containing: - List of document chunks (strings): The scraped text split into chunks - List of metadata dictionaries: Metadata for each chunk including URL information and scraping details - List of unique IDs: UUID strings for each chunk

Raises:

ValueError If the scraping fails or no text content is extracted.

Examples:

# Basic web scraping
documents, metadatas, ids = builder.from_url("https://example.com")

# Advanced scraping with Tavily
documents, metadatas, ids = builder.from_url(
    "https://blog.example.com",
    engine="tavily",
    deep=True
)

# JavaScript-heavy site with Selenium
documents, metadatas, ids = builder.from_url(
    "https://spa.example.com",
    engine="selenium"
)

# Access scraping metadata
for metadata in metadatas:
    print(f"Source: {metadata['url']}")
    print(f"Engine: {metadata['scraping_engine']}")
    print(f"Deep extraction: {metadata['deep_extraction']}")

Notes:

  • Scraping may take time depending on the engine and website complexity
  • Some websites may block automated scraping
  • Selenium requires Chrome/Chromium to be installed
  • Tavily requires an API key to be configured