Keyword, Semantic and Hybrid Search

Keyword, Semantic and Hybrid Search

Carbon enables both semantic and hybrid search across documents from connected data sources. However, users often wonder about the appropriate use cases and methods for each search technique. In this post, we break down each type of search and explain how to leverage them in Carbon.

To better understand hybrid search, we need to first delve into the search techniques it combines — semantic and keyword search.

Keyword Search

Keyword search yields results by querying specific words or phrases to obtain relevant information. It relies on the assumption that pertinent documents will contain the queried keywords. The more a document matches the keywords, the higher its relevance.

Sparse vectors are what enable keyword search. These are high-dimensional vectors with only a small number of non-zero values. And within keyword search, each sparse vector symbolizes a document. The dimensions represent words from a dictionary, and the values indicate the significance of these words in the document.

Although keyword search is widely used, it can be ineffective if the content doesn't include exact matches to the keywords in the search query.

Semantic Search

Semantic search takes into account the context and relationships between entities and concepts to deliver more accurate results, as opposed to merely relying on keyword matching. Semantic search can comprehend user queries like human understanding, thereby enriching the search experience.

Semantic search delivers the most relevant results based on the high-level features of the queried text, rather than specific words. While it shares similarities with keyword search in providing relevant results based on distance metrics, semantic search differs in that it can return results without exact matches. This is made possible through dense vectors, generated by embedding models like SBERT, which numerically represent semantic meaning.

While semantic search is powerful, it may not return the exact values you're seeking if you know precisely what you're searching for.

Hybrid Search

Hybrid Search merges strengths of both keyword and semantic search to provide a comprehensive and versatile search experience. It acknowledges the limitations of each method and aims to address them by combining their benefits.

In Carbon, hybrid search operates by first performing individual searches across both sparse and dense vectors simultaneously. It then utilizes a method known as Reciprocal Rank Fusion (RRF) to produce a consolidated result set, integrating the most pertinent outcomes from both hybrid and semantic search.

Our testing has revealed that hybrid search is particularly effective for querying files and data sources with a lot of tabular data. This includes files such as CSV, XLSX, and PDF with tables, as well as data sources like Notion, Gitbook, and web pages.

The disadvantage of hybrid search is that it can lead to increased latency compared to using only semantic search. This is due to the extra step of generating sparse vectors for the search query. Furthermore, users may experience slightly extended processing times, as the generation of sparse vectors is also required after file upload.

Implementing Search via Carbon

Hybrid search in Carbon can be activated at the user, file, and search levels.

To enable sparse vector generation for a specific user (customer-id), use the /modify_user_configuration endpoint. This action prompts the creation of a sparse vector index for that user.

When a user uploads content, either locally or from third-party connectors, sparse vector generation can be enabled. Only content with generated sparse vectors can be subjected to a hybrid search later.

To perform a hybrid search, set the hybrid_search parameter in the /embeddings request body to TRUE. When hybrid search is active, it uses a mix of keyword search and semantic search to rank and select candidate embeddings during information retrieval. By default, these search methods have equal weight during ranking. To adjust the weight (or "importance") of each search method, use the hybrid_search_tuning_parameters property.

You can find more specific instructions for enabling hybrid search here.

Start building with Carbon today.

Build powerful GenAI apps
in under 10 minutes.

CARBON

Data Connectors for LLMs

COPYRIGHT @ 2024 JCDT DBA CARBON