Description
Disclaimer: this is a feature request. I am able to open a PR, but first I want to check whether this makes sense to other people as well.
Problem
Commonly when using vector search, I want to know how relevant the results are. Not only do I want to get the most relevant results, but I also want to add a threshold, and cap any result that does not surpasses this value.
Other frameworks such as langchain offer support to querying with scores (and, therefore, filtering by thresholds) by annotating the query with a distance
parameter, and then normalizing this value and filtering the results on python layer. This is useful, because it does not change query complexity - same indexes can be used, regardless of ANN indexing or not indexing at all.
My use cases involve querying over Django. While a Django specific solution will help, it is possible to consider this support for other connectors as well.
Proposition
Add a Model Queryset that adds automatic annotation + filtering for distance and thresholds. The interface can look like this:
from django.db import models
from pgvector.django import VectorField, PgVectorModelMixin
# Model definition
class MyModel(PgVectorModelMixin, models.Model):
embedding = VectorField(dimensions=3)
# database entries
good_match_embedding = MyModel(embedding=[1, 2, 3])
bad_match_embedding = MyModel(embedding=[-100, -100, -100])
good_match_embedding.save()
bad_match_embedding.save()
# embedding similar to `good_match_embedding`
query_embedding = [1, 2, 4]
best_matches = MyModel.objects.similarity(embedding=query_embedding, threshold=0.7)
print(best_matches) # A list with only good_match_embedding