RAG: Total win of vectors over keywords? Not yet!

Cyril Chirkunov
1 min readJul 4, 2024

--

In the era of embedding-based RAG demos, it’s easy to forget the decades of research in information retrieval. While embeddings are powerful, they aren’t the ultimate solution. They excel in capturing high-level similarities but can struggle with specific keyword-based queries, like searching for names (e.g., Fujairah), acronyms (e.g., ADDC), or unique identifiers (e.g., Meta-Llama-3–70B-Instruct). This is where keyword-based search engines like BM25 still excel.

After years of relying on keyword search, users expect precise results. When these aren’t delivered, it can be frustrating. This shows the importance of combining traditional keyword search with modern technologies to meet user expectations and ensure accurate information retrieval.

Keyword search is easier to understand since we can see which keywords match the query. Embedding-based retrieval is less clear. Also, keyword search, supported by systems like Lucene and OpenSearch, is usually more efficient.

⚡ A hybrid approach is often best: use keyword matching for clear hits and embeddings for synonyms, hypernyms, spelling errors, and multimodal content (e.g., images and text).

--

--

Cyril Chirkunov
Cyril Chirkunov

Written by Cyril Chirkunov

Extracting shining facts from data dumps

No responses yet