Vector Search Databases


Created: 05 Jan 2023, 09:40 AM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, tools


We have come close to involving machine learning on the fundamental level in the search engine experience: [encoding objects in a multidimensional multimodal space]{.mark}. This is different from a traditional keyword lookup (even if enhanced with synonyms / semantics) --- in so many interesting ways:

  • Collection-level similarity on object level. You can find [neighbors to your query using a similarity function (distance metric) instead of a sparse keyword lookup]{.mark}. In BM25/TF-IDF approach with sharding you would be having document scores from incompatible shard-level collections (unless you set up a globally updatedIDFcache).
  • Have a notion of [geometric similarity as a component in semantics]{.mark}, rather than only specific attributes of the raw object (in the case of text --- its keywords / terms).
  • [Multimodality]{.mark}: encode any object --- audio, video, image, text, genome,software virus, some complex object (likecode) for which you have an encoder and a similarity measure --- and search seamlessly across such objects.

From <https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696>

When Machine learning comes into picture, the database corresponds to a collection of vectors. Vectors can be seen as [high dimensional representations of the input data]{.mark} generated by machine learning algorithms. Similarity searching in this context means [searching for similar vectors for a given query vector based on some similarity or distance measure]{.mark}.

A [naive way for searching based on similarity is to compare the query vector with every other vector]{.mark} in the database. But what if the database has [more than a million vectors]{.mark}? Enter [FAISS]{.mark}…

From <https://towardsdatascience.com/understanding-faiss-619bb6db2d1a>

This month (March 2017), we released [Facebook AI Similarity Search (Faiss)]{.mark}, a library that allows us to quickly search for multimedia documents that are similar to each other --- a challenge where traditional query search engines fall short. We’ve built [nearest-neighbor search implementations for billion-scale data sets that are some 8.5x faster than the previous reported state-of-the-art, along with the fastest k-selection algorithm on the GPU known in the literature.]{.mark} This lets us break some records, including the first k-nearest-neighbor graph constructed on 1 billion high-dimensional vectors.

From <https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/>

GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.

Welcome to Faiss Documentation --- Faiss documentation