A flexible and modular vector search system for document processing, embedding generation, and similarity search.
- 📁 Multiple Source Providers
- Local folder and file processing
- Google Drive integration with Google Workspace support
- Azure Blob Storage integration
- Custom source provider support
-
📝 Flexible Text Chunking
- Word-based chunking
- Character-based chunking
- Custom chunking strategies
- Configurable chunk size and overlap
-
🔄 Text Augmentation
- Multiple AI providers (Ollama, OpenAI, Azure OpenAI)
- Semantic variation generation
- Original content preservation
- Customizable augmentation parameters
-
🏷️ AI-Powered Tag Generation
- Free-form tag generation
- Predefined tag selection
- Multiple model support (Ollama, OpenAI, Azure)
- Customizable tag parameters
- 🧠 Multiple Embedding Providers
- Ollama (local processing)
- OpenAI
- Azure OpenAI
- Custom embedding support
- 💾 Database Options
- PostgreSQL with pgvector
- Supabase integration
- Vector similarity search
- Metadata storage and filtering
-
🔍 Text Filtering
- Custom preprocessing filters
- Format-specific handling
- Metadata preservation
-
🔐 Security
- Environment-based configuration
- API key management
- Secure credential handling
-
🛠️ Extensibility
- Custom component support
- Modular architecture
- Easy integration options
- Installation
- Environment Setup
- Basic Usage
- Components
- Advanced Features
- Custom Components
- Testing
- Contributing
- Clone the repository:
git clone https://github.com/yourusername/vector-search.git
cd vector-search- Create and activate a virtual environment using
uv:
uv venv
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windows- Install dependencies:
uv pip install -r requirements.txtTo use vector-search in your own projects:
- Create and activate a virtual environment for your project:
mkdir your-project
cd your-project
uv venv
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windows- Install vector-search directly from the project path:
uv pip install '/path/to/vector-search'Now you can import and use vector-search in your code:
from vector_search import VectorSearchCreate a .env file in your project root:
# Database Configuration
POSTGRES_DB=your_db_name
POSTGRES_USER=your_user
POSTGRES_PASSWORD=your_password
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
VECTOR_DIM=1536
# Chunking Configuration
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
# AI Provider Configuration
OPENAI_API_KEY=your_api_key
AZURE_OPENAI_API_KEY=your_azure_key
AZURE_OPENAI_API_VERSION=2024-02-01
AZURE_OPENAI_ENDPOINT=your_azure_endpoint
AZURE_OPENAI_DEPLOYMENT=your_deployment_name
OLLAMA_BASE_URL=http://localhost:11434
# Storage Provider Configuration
GOOGLE_APPLICATION_CREDENTIALS=path/to/credentials.json
AZURE_STORAGE_CONNECTION_STRING=your_connection_string
AZURE_STORAGE_CONTAINER=your_container_name
SUPABASE_URL=your_project_url
SUPABASE_KEY=your_api_key
SUPABASE_TABLE=chunksfrom vector_search import VectorSearch
# Initialize with default components
vector_search = VectorSearch(
source_type="folder", # Use folder source
chunker_type="word", # Use word-based chunking
embedding_type="ollama", # Use Ollama for embeddings
database_type="postgres", # Use PostgreSQL database
augment=False # No text augmentation
)
# Process documents from a folder
vector_search.process_source("path/to/documents")
# Search for similar content
results = vector_search.search(
query="What is machine learning?",
limit=5,
min_similarity=0.7
)
# Print results
for result in results:
print(f"Text: {result['text']}")
print(f"Similarity: {result['similarity']}")
print(f"Source: {result['metadata']['source']}")
print(f"Date: {result['date']}")
print("---")from vector_search import VectorSearch
# Using folder source
vs = VectorSearch(source_type="folder")
vs.process_source("path/to/folder")
# Using single file source
vs = VectorSearch(source_type="file")
vs.process_source("path/to/file.txt")from vector_search import VectorSearch
vs = VectorSearch(source_type="google_drive")
vs.process_source("your_folder_id") # From folder URLSupports:
- Google Docs (→ Markdown)
- Google Sheets (→ Markdown)
- Google Slides (→ Markdown)
- PDFs, Text files, etc.
from vector_search import VectorSearch
vs = VectorSearch(source_type="azure_blob")
vs.process_source("container/path")from vector_search import VectorSearch
# Word-based chunking
vs = VectorSearch(
chunker_type="word",
chunk_size=1000,
chunk_overlap=200
)
# Character-based chunking
vs = VectorSearch(
chunker_type="character",
chunk_size=4000,
chunk_overlap=400
)from vector_search import VectorSearch
# Using Ollama for augmentation
vs = VectorSearch(
augment=True,
augmenter_type="ollama",
augmenter_config={
"model_name": "llama3.1:8b"
}
)
# Using OpenAI
vs = VectorSearch(
augment=True,
augmenter_type="openai",
augmenter_config={
"model_name": "gpt-3.5-turbo"
}
)from vector_search.tags import OllamaTagGenerator, OpenAITagGenerator
# Free-form tags
generator = OllamaTagGenerator(max_tags=3)
tagged_chunks = generator.generate_tags(chunks)
# Predefined tags
generator = OpenAITagGenerator(
max_tags=3,
predefined_tags={"ai", "machine learning", "python"}
)
tagged_chunks = generator.generate_tags(chunks)from vector_search import VectorSearch
vs = VectorSearch(
embedding_type="ollama",
embedding_config={
"model_name": "bge-m3:latest"
}
)vs = VectorSearch(
embedding_type="openai",
embedding_config={
"model_name": "text-embedding-3-small"
}
)vs = VectorSearch(
embedding_type="azure_openai",
embedding_config={
"deployment": "your-deployment"
}
)from vector_search import VectorSearch
vs = VectorSearch(database_type="postgres")Required schema:
CREATE TABLE chunks (
id SERIAL PRIMARY KEY,
embedding vector(1536),
text TEXT NOT NULL,
metadata JSONB,
date TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);vs = VectorSearch(database_type="supabase")Additional function for Supabase:
create or replace function match_chunks (
query_embedding vector(1536),
match_threshold float,
match_count int
)
returns table (
id bigint,
text text,
metadata jsonb,
date timestamptz,
similarity float
)
language plpgsql
as $$
begin
return query
select
chunks.id,
chunks.text,
chunks.metadata,
chunks.date,
1 - (chunks.embedding <=> query_embedding) as similarity
from chunks
where 1 - (chunks.embedding <=> query_embedding) > match_threshold
order by chunks.embedding <=> query_embedding
limit match_count;
end;
$$;def custom_filter(text: str) -> str:
# Remove specific patterns
text = text.replace("[REMOVE]", "")
# Convert to lowercase
text = text.lower()
# Clean whitespace
text = " ".join(text.split())
return text
vs = VectorSearch(
source_type="folder",
text_filter=custom_filter
)from vector_search.sources import BaseSource
class CustomSource(BaseSource):
def load_documents(self, source_path: str):
# Your implementation
yield {
"text": content,
"metadata": metadata
}from vector_search.chunker import BaseChunker
class CustomChunker(BaseChunker):
def chunk_text(self, text: str, metadata: dict = None):
# Your implementation
return chunksfrom vector_search.embeddings import BaseEmbedding
class CustomEmbedding(BaseEmbedding):
def embed(self, texts: Union[str, List[str]]):
# Your implementation
return embeddingsRun specific tests:
python -m pytest tests/test_database.py
python -m pytest tests/test_embeddings.py
python -m pytest tests/test_sources.pyRun all tests:
python -m pytest tests/- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
The source providers support custom text filtering through an optional text_filter parameter. This allows you to preprocess text content before it's processed further in the pipeline.
from vector_search.sources import FolderSource, FileSource
# Create a simple lowercase filter
def lowercase_filter(text: str) -> str:
return text.lower()
# Initialize source with the filter
source = FolderSource(text_filter=lowercase_filter)
# All documents loaded will have lowercase text
for doc in source.load_documents("path/to/folder"):
print(doc["text"]) # Text will be lowercase- Remove Punctuation:
def remove_punctuation_filter(text: str) -> str:
return "".join(char for char in text if char.isalnum() or char.isspace())
source = FileSource(text_filter=remove_punctuation_filter)- Clean Extra Whitespace:
def clean_whitespace_filter(text: str) -> str:
return " ".join(text.split())
source = FolderSource(text_filter=clean_whitespace_filter)- Custom Text Processing:
def custom_filter(text: str) -> str:
# Remove specific patterns
text = text.replace("[REMOVE]", "")
# Convert to lowercase
text = text.lower()
# Clean whitespace
text = " ".join(text.split())
return text
source = FileSource(text_filter=custom_filter)- The filter function receives the raw text content as a string
- It must return the processed text as a string
- The filter is applied before the text is added to the document dictionary
- Original metadata remains unchanged
- The filter is applied consistently across all supported file formats
- If no filter is provided, the text remains unchanged
Text filters can be used with any source provider:
FolderSource: Applied to all files in the folderFileSource: Applied to the single fileGoogleDriveSource: Applied to Google Drive documentsAzureBlobSource: Applied to Azure Blob Storage files
The system supports reading files directly from Google Drive, including native Google Workspace files (Docs, Sheets, Slides) which are automatically converted to Markdown format.
-
Create a Google Cloud Project:
- Go to Google Cloud Console
- Create a new project or select an existing one
- Enable the Google Drive API:
APIs & Services > Library > Search for "Google Drive API" > Enable
-
Create OAuth 2.0 Credentials:
- Go to
APIs & Services > Credentials - Click
Create Credentials > OAuth client ID - Select
Desktop Application - Download the credentials JSON file
- Go to
-
Generate Access Token:
from google_auth_oauthlib.flow import InstalledAppFlow import json # If modifying scopes, delete token.json. SCOPES = ['https://www.googleapis.com/auth/drive.readonly'] def generate_token(): # Load client configuration from downloaded credentials flow = InstalledAppFlow.from_client_config( # Your downloaded credentials client_config={ "installed": { "client_id": "YOUR_CLIENT_ID", "project_id": "YOUR_PROJECT_ID", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "client_secret": "YOUR_CLIENT_SECRET", "redirect_uris": ["http://localhost"] } }, scopes=SCOPES ) # Run local server for authentication creds = flow.run_local_server(port=0) # Save the credentials token_data = { 'token': creds.token, 'refresh_token': creds.refresh_token, 'token_uri': creds.token_uri, 'client_id': creds.client_id, 'client_secret': flow.client_config['installed']['client_secret'], 'scopes': creds.scopes } with open('token.json', 'w') as token_file: json.dump(token_data, token_file) if __name__ == '__main__': generate_token()
-
Set up environment variables:
GOOGLE_APPLICATION_CREDENTIALS=path/to/your/token.json
from vector_search import VectorSearch
# Initialize with Google Drive source
vector_search = VectorSearch(
source_type="google_drive",
chunker_type="word",
embedding_type="ollama",
database_type="postgres"
)
# Process documents from a Google Drive folder
folder_id = "your_folder_id" # Get this from the folder's URL
vector_search.process_source(folder_id)The Google Drive integration supports various file types:
-
Google Workspace Files (automatically converted to Markdown):
- Google Docs (
application/vnd.google-apps.document) - Google Sheets (
application/vnd.google-apps.spreadsheet) - Google Slides (
application/vnd.google-apps.presentation)
- Google Docs (
-
Regular Files:
- Text files (
text/plain) - Markdown files (
text/markdown) - JSON files (
application/json) - PDF files (
application/pdf) - HTML files (
text/html)
- Text files (
To get a folder's ID:
- Open the folder in Google Drive
- The URL will look like:
https://drive.google.com/drive/folders/1234... - The long string after
folders/is your folder ID
from vector_search.sources import GoogleDriveSource
# Initialize the source
source = GoogleDriveSource()
# Process files from a folder
for doc in source.load_documents("your_folder_id"):
print(f"File: {doc['metadata']['source']}")
print(f"Format: {doc['metadata']['format']}")
print(f"Original MIME type: {doc['metadata']['original_mime_type']}")
print(f"Content preview: {doc['text'][:100]}...")Each document includes metadata:
{
"source": "filename_without_extension",
"format": "md", # Format after conversion (e.g., 'md' for Google Docs)
"drive_id": "google_drive_file_id",
"mime_type": "text/markdown", # Export format
"original_mime_type": "application/vnd.google-apps.document" # Original format
}The system handles errors gracefully:
- Invalid files are skipped with error messages
- Unsupported formats are ignored
- Authentication errors are reported clearly
If you encounter authentication errors:
- Delete the
token.jsonfile - Run the token generation script again
- Make sure your Google account has access to the folder
The vector-search library supports text augmentation to enhance search quality by generating semantically similar variations of your text chunks. This can help improve recall and make the search more robust.
Uses Ollama's local LLMs to generate text variations. Ideal for when you want to keep data processing local or have privacy requirements.
from vector_search.augmentation import OllamaAugmenter
augmenter = OllamaAugmenter(
model_name="llama3.1:8b", # Default model
base_url="http://localhost:11434" # Optional, defaults to OLLAMA_BASE_URL env var
)Uses OpenAI's GPT models to generate high-quality text variations.
from vector_search.augmentation import OpenAIAugmenter
augmenter = OpenAIAugmenter(
api_key="your_api_key", # Optional, defaults to OPENAI_API_KEY env var
model_name="gpt-3.5-turbo" # Default model
)Uses Azure's OpenAI service for text augmentation, suitable for enterprise deployments.
from vector_search.augmentation import AzureOpenAIAugmenter
augmenter = AzureOpenAIAugmenter(
api_key=None, # Optional, defaults to AZURE_OPENAI_API_KEY env var
api_version=None, # Optional, defaults to AZURE_OPENAI_API_VERSION env var
endpoint=None, # Optional, defaults to AZURE_OPENAI_ENDPOINT env var
deployment="text-davinci-003" # Default deployment
)# Prepare your text chunks
chunks = [
{
"text": "Machine learning enables systems to learn from experience.",
"metadata": {
"source": "doc.txt"},
"chunk_index": 0
}
]
# Initialize an augmenter
augmenter = OllamaAugmenter()
# Generate augmented chunks
augmented_chunks = augmenter.augment(chunks)
# Each original chunk will be preserved and augmented versions will be added
for chunk in augmented_chunks:
print(f"Text: {chunk['text']}")
if chunk['metadata'].get('augmented'):
print(f"Augmented version of chunk {chunk['metadata']['original_chunk_index']}")Configure your augmenters using these environment variables in your .env file:
# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
# OpenAI Configuration
OPENAI_API_KEY=your_openai_key
# Azure OpenAI Configuration
AZURE_OPENAI_API_KEY=your_azure_key
AZURE_OPENAI_API_VERSION=your_api_version
AZURE_OPENAI_ENDPOINT=your_endpoint- Each augmenter preserves the original chunks and adds augmented versions
- Augmented chunks include metadata indicating:
augmented: true- Marks it as an augmented versionoriginal_chunk_index- References the source chunk
- The augmentation process maintains the core meaning while varying the text structure and wording
- All metadata from the original chunk is preserved in the augmented versions
- Choose the appropriate augmenter based on your needs:
- Ollama for local processing and privacy
- OpenAI for high-quality results
- Azure OpenAI for enterprise deployments
- Test augmented results to ensure they maintain semantic accuracy
- Consider the trade-off between augmentation quantity and processing time
- Use environment variables for API keys and configuration
The vector-search library includes powerful AI-powered tag generation capabilities. You can either generate free-form tags or select from a predefined list of tags. The feature supports multiple AI providers:
Generate tags freely based on the content:
from vector_search.tags import OllamaTagGenerator
generator = OllamaTagGenerator(
model_name="llama3.1:8b", # Model to use
max_tags=3, # Maximum tags per chunk
temperature=0.3, # Model creativity (0.0 to 1.0)
base_url=None # Optional: Defaults to OLLAMA_BASE_URL env var
)
# Process chunks
chunks = [
{
"text": "Machine learning enables systems to learn from experience.",
"metadata": {"source": "doc.txt"},
"chunk_index": 0
}
]
tagged_chunks = generator.generate_tags(chunks)
# Result: {"tags": ["machine learning", "artificial intelligence", "automation"]}from vector_search.tags import OpenAITagGenerator
generator = OpenAITagGenerator(
model_name="gpt-3.5-turbo", # OpenAI model to use
max_tags=3,
temperature=0.3,
api_key=None # Optional: Defaults to OPENAI_API_KEY env var
)from vector_search.tags import AzureOpenAITagGenerator
generator = AzureOpenAITagGenerator(
deployment="your-deployment",
max_tags=3,
temperature=0.3,
api_key=None, # Optional: From AZURE_OPENAI_API_KEY
api_version=None, # Optional: From AZURE_OPENAI_API_VERSION
endpoint=None # Optional: From AZURE_OPENAI_ENDPOINT
)Select tags from a predefined list:
from vector_search.tags import OllamaPredefinedTagSelector
# Define allowed tags
predefined_tags = {
"machine learning",
"artificial intelligence",
"programming",
"python",
"data science",
"deep learning"
}
selector = OllamaPredefinedTagSelector(
predefined_tags=predefined_tags,
model_name="llama3.1:8b",
max_tags=3
)
# Process chunks
tagged_chunks = selector.generate_tags(chunks)
# Result: {"tags": ["machine learning", "artificial intelligence"]}from vector_search.tags import OpenAIPredefinedTagSelector
selector = OpenAIPredefinedTagSelector(
predefined_tags=predefined_tags,
model_name="gpt-3.5-turbo",
max_tags=3
)from vector_search.tags import AzureOpenAIPredefinedTagSelector
selector = AzureOpenAIPredefinedTagSelector(
predefined_tags=predefined_tags,
deployment="your-deployment",
max_tags=3
)# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
# OpenAI Configuration
OPENAI_API_KEY=your_openai_key
# Azure OpenAI Configuration
AZURE_OPENAI_API_KEY=your_azure_key
AZURE_OPENAI_API_VERSION=your_api_version
AZURE_OPENAI_ENDPOINT=your_endpoint-
Multiple AI Providers:
- Ollama for local processing
- OpenAI for cloud processing
- Azure OpenAI for enterprise deployments
-
Two Tag Generation Modes:
- Free-form tag generation
- Selection from predefined tags
-
Customization Options:
- Control number of tags per chunk
- Adjust model temperature
- Choose specific models/deployments
-
Error Handling:
- Graceful handling of API errors
- Preservation of original chunks on failure
- Validation of predefined tags
-
Metadata Integration:
- Tags added to chunk metadata
- Original metadata preserved
- Chunk indices maintained
Each processed chunk includes:
{
"text": "Original text content",
"metadata": {
"source": "original_source",
"format": "txt",
"tags": ["tag1", "tag2", "tag3"], # Generated/selected tags
# ... other original metadata ...
},
"chunk_index": 0
}-
Model Selection:
- Use Ollama for privacy-sensitive data
- Use OpenAI for high-quality results
- Use Azure OpenAI for enterprise compliance
-
Tag Generation:
- Keep max_tags reasonable (2-5 recommended)
- Use lower temperature (0.1-0.3) for consistency
- Use higher temperature (0.5-0.7) for creativity
-
Predefined Tags:
- Keep tag list focused and relevant
- Use consistent formatting
- Consider hierarchical relationships
-
Error Handling:
- Always check for empty tag lists
- Validate predefined tags
- Handle API rate limits
Required packages:
langchain>=0.1.0
langchain-openai>=0.0.5
langchain-community>=0.0.15