Metadata-Version: 2.4
Name: prevectorchunks-core
Version: 0.1.25
Summary: A Python module that allows conversion of a document into chunks to be inserted into Pinecone vector database
Author-email: Zul Al-Kabir <zul.developer.2023@gmail.com>
Project-URL: Homepage, https://github.com/yourusername/mydep
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: packaging~=24.1
Requires-Dist: requests~=2.32.3
Requires-Dist: openai<3.0.0,>=2.6.0
Requires-Dist: python-dotenv~=1.0.1
Requires-Dist: PyJWT~=2.7.0
Requires-Dist: fastapi~=0.112.2
Requires-Dist: datasets~=4.1.0
Requires-Dist: pinecone~=7.3.0
Requires-Dist: pytesseract~=0.3.13
Requires-Dist: python-docx~=1.2.0
Requires-Dist: PyPDF2~=3.0.1
Requires-Dist: pillow~=11.3.0
Requires-Dist: torch~=2.6.0
Requires-Dist: torchvision~=0.21.0
Requires-Dist: torchaudio~=2.6.0
Requires-Dist: sentence-transformers~=5.1.1
Requires-Dist: py-gutenberg~=1.0.3
Requires-Dist: langchain-text-splitters~=0.3.11
Requires-Dist: langchain~=0.3
Requires-Dist: langchain_openai~=0.3.35
Requires-Dist: transformers>=4.30.0
Requires-Dist: accelerate>=0.22.0
Requires-Dist: imageio-ffmpeg>=0.4.8
Requires-Dist: opencv-python>=4.10.0
Requires-Dist: pathlib~=1.0.1
Requires-Dist: transformers~=4.57.0
Requires-Dist: imageio-ffmpeg~=0.6.0
Requires-Dist: opencv-python~=4.12.0.88
Requires-Dist: requests~=2.32.5
Requires-Dist: langchain-core~=0.3.78
Requires-Dist: pdf2image~=1.17.0
Requires-Dist: docx2pdf~=0.1.8
Requires-Dist: numpy~=2.2.6
Requires-Dist: scikit-learn~=1.7.2
Requires-Dist: PyMuPDF~=1.22.5
Dynamic: license-file

# 📚 PreVectorChunks

> A lightweight utility for **document chunking** and **vector database upserts** — designed for developers building **RAG (Retrieval-Augmented Generation)** solutions.

---

## ✨ Who Needs This Module?
Any developer working with:
- **RAG pipelines**
- **Vector Databases** (like Pinecone, Weaviate, etc.)
- **AI applications** requiring **similar content retrieval**

---

## 🎯 What Does This Module Do?
This module helps you:
- **Chunk documents** into smaller fragments using:
  - a pretrained Reinforcement Learning based model or
  - a pretrained Reinforcement Learning based model with proposition indexing or
  - standard word chunking
  - recursive character based chunking
  - character based chunking
- **Insert (upsert) fragments** into a vector database  
- **Fetch & update** existing chunks from a vector database  

---

## 📦 Installation
```bash
pip install prevectorchunks-core
```

How to import in a file:  
```python
from PreVectorChunks.services import chunk_documents_crud_vdb
```

**Use .env for API keys:IMPORTANT: PLEASE ENSURE TO PROVIDE YOUR OPENAI_API_KEY as MINIMUM in an .env file or as required**
```
PINECONE_API_KEY=YOUR_API_KEY
OPENAI_API_KEY=YOUR_API_KEY
```

---

## 📄 Functions

### 1. `chunk_documents`
```python
chunk_documents(instructions, file_path="content_playground/content.json", splitter_config=SplitterConfig())
```
Splits the content of a document into smaller, manageable chunks. - Five types of document chunking
- Chunking using Reinforcement Learning based pretrained model +(enable/disable LLM to structure the chunked text - default is enabled)
- Chunking using Reinforcement Learning based pretrained model and proposition indexing +(enable/disable LLM to structure the chunked text - default is enabled)
- Recursive Character based chunking +(enable/disable LLM to structure the chunked text - default is enabled)
- Standard word based chunking+(enable/disable LLM to structure the chunked text - default is enabled)
- Simple character based chunking +(enable/disable LLM to structure the chunked text - default is enabled)


**Parameters**
- `instructions` (*dict or str*): Additional rules or guidance for how the document should be split.  
  - Example: `"split my content by biggest headings"`
- `file_path` (*str*): Binary file or file path to the input file containing the content or content of the file. Default: `"content_playground/content.json"`.
- `splitter_config (optional) ` (*SplitterConfig*): (if none provided standard split takes place) Object that defines chunking behavior, e.g., `chunk_size`, `chunk_overlap`, `separator`, `split_type`.
- i.e. splitter_config = SplitterConfig(chunk_size= 300, chunk_overlap= 0,separators=["\n"],split_type=SplitType.RECURSIVE.value)
- (chunk_size refers to size in characters (i.e. 100 characters) when RECURSIVE is used)
- i.e. splitter_config = SplitterConfig(chunk_size= 300, chunk_overlap= 0,separators=["\n"],split_type=SplitType.CHARACTER.value)
- - (chunk_size refers to size in characters (i.e. 100 characters) when CHARACTER is used)
- i.e. splitter_config = SplitterConfig(chunk_size= 300, chunk_overlap= 0,separators=["\n"],split_type=SplitType.STANDARD.value)
- - (chunk_size refers to size in words (i.e. 100 characters) when STANDARD is used)
- i.e. splitter_config = SplitterConfig(separators=["\n"],
                                     split_type=SplitType.R_PRETRAINED.value, min_rl_chunk_size=5,
                                     max_rl_chunk_size=50,enableLLMTouchUp=False)
- - (min_rl_chunk_size and max_rl_chunk_size refers to size in sentences (i.e. 100 sentences) when R_PRETRAINED is used)
- i.e. splitter_config = SplitterConfig(separators=["\n"],
                                     split_type=SplitType.R_PRETRAINED_PROPOSITION.value, min_rl_chunk_size=5,
                                     max_rl_chunk_size=50,enableLLMTouchUp=False)
- - (min_rl_chunk_size and max_rl_chunk_size refers to size in sentences (i.e. 100 sentences) when R_PRETRAINED_PROPOSITION is used)
**Returns**
- A list of chunked strings including a unique id, a meaningful title and chunked text

**Use Cases**
- Preparing text for LLM ingestion
- Splitting text by structure (headings, paragraphs)
- Vector database indexing

---

### 2. `chunk_and_upsert_to_vdb`
```python
chunk_and_upsert_to_vdb(index_n, instructions, file_path="content_playground/content.json", splitter_config=SplitterConfig())
```
Splits a document into chunks (via `chunk_documents`) and **inserts them into a Vector Database**.

**Parameters**
- `index_n` (*str*): The name of the VDB index where chunks should be stored.
- `instructions` (*dict or str*): Rules for splitting content (same as `chunk_documents`).
- `file_path` (*str*): Path to the document file or content of the file. Default: `"content_playground/content.json"`.
- `splitter_config` (*SplitterConfig*): Object that defines chunking behavior.

**Returns**
- Confirmation of successful insert into the VDB.

**Use Cases**
- Automated document preprocessing and storage for vector search
- Preparing embeddings for semantic search

---

### 3. `fetch_vdb_chunks_grouped_by_document_name`
```python
fetch_vdb_chunks_grouped_by_document_name(index_n)
```
Fetches existing chunks stored in the Vector Database, grouped by **document name**.

**Parameters**
- `index_n` (*str*): The name of the VDB index.

**Returns**
- A dictionary or list of chunks grouped by document name.

**Use Cases**
- Retrieving all chunks of a specific document
- Verifying what content has been ingested into the VDB

---

### 4. `update_vdb_chunks_grouped_by_document_name`
```python
update_vdb_chunks_grouped_by_document_name(index_n, dataset)
```
Updates existing chunks in the Vector Database by document name.

**Parameters**
- `index_n` (*str*): The name of the VDB index.  
- `dataset` (*dict or list*): The new data (chunks) to update existing entries.

**Returns**
- Confirmation of update status.

**Use Cases**
- Keeping VDB chunks up to date when documents change
- Re-ingesting revised or corrected content

---
### 5. ``markdown_and_chunk_documents``
```python
from prevectorchunks_core.services.markdown_and_chunk_documents import MarkdownAndChunkDocuments

markdown_processor = MarkdownAndChunkDocuments()
mapped_chunks = markdown_processor.markdown_and_chunk_documents("example.pdf")
```

**Description**  
This new function automatically:
1. Converts a document (PDF, DOCX, etc.) into images using `DocuToImageConverter`.
2. Extracts **Markdown and text** content from those images using `DocuToMarkdownExtractor` (powered by GPT).
3. Converts the extracted markdown text into **RL-based chunks** using `ChunkMapper` and `chunk_documents`.
4. Merges unmatched markdown segments into the final structured output.

**Parameters**
- `file_path` (*str*): Path to the document (PDF, DOCX, or image) you want to process.

**Returns**
- `mapped_chunks` (*list[dict]*): A list of markdown-based chunks with both markdown and chunked text content.

**Example**
```python
if __name__ == "__main__":
    markdown_processor = MarkdownAndChunkDocuments()
    mapped_chunks = markdown_processor.markdown_and_chunk_documents("421307-nz-au-top-loading-washer-guide-shorter.pdf")
    print(mapped_chunks)
```

**Use Cases**
- End-to-end document-to-markdown-to-chunks pipeline
- Automating preprocessing for RAG/LLM ingestion
- Extracting structured markdown for semantic search or content indexing

---

## 🚀 Example Workflow
```python
from prevectorchunks_core.config import SplitterConfig

splitter_config = SplitterConfig(chunk_size=150, chunk_overlap=0, separator=["\n"], split_type=SplitType.R_PRETRAINED_PROPOSITION.value)

# Step 1: Chunk a document
chunks = chunk_documents(
    instructions="split my content by biggest headings",
    file_path="content_playground/content.json",
    splitter_config=splitter_config
)

splitter_config = SplitterConfig(chunk_size=300, chunk_overlap=0, separators=["\n"],
                                     split_type=SplitType.R_PRETRAINED_PROPOSITION.value, min_rl_chunk_size=5,
                                     max_rl_chunk_size=50,enableLLMTouchUp=False)

chunks=chunk_documents_crud_vdb.chunk_documents("extract", file_name=None, file_path="content.txt",splitter_config=splitter_config)

# Step 2: Insert chunks into VDB
chunk_and_upsert_to_vdb("my_index", instructions="split by headings", splitter_config=splitter_config)

# Step 3: Fetch stored chunks
docs = fetch_vdb_chunks_grouped_by_document_name("my_index")

# Step 4: Update chunks if needed
update_vdb_chunks_grouped_by_document_name("my_index", dataset=docs)
```

---

## 🛠 Use Cases
- Preprocessing documents for LLM ingestion  
- Semantic search and Q&A systems  
- Vector database indexing and retrieval  
- Maintaining versioned document chunks

