Boost RAG Accuracy with Claude’s Advanced Contextual Retrieval Tech

type

status

date

slug

summary

Technical Principles of Contextual Retrieval

The technical principle of Contextual Retrieval is to improve the accuracy of retrieval by adding relevant contextual information to each document fragment (chunk) in the knowledge base, especially in large-scale knowledge bases or complex queries. It combines two main technologies: Contextual Embeddings and Contextual BM25 , which are used to process semantic matching and exact matching respectively, ensuring that key contextual information is not lost during retrieval.

The following is the detailed technical principle of Contextual Retrieval:

1. Contextual Embeddings

Contextual Embeddings adds additional background information to each text fragment before generating the semantic embedding of the text fragment, so that the model can more accurately understand the semantics and contextual relationships of these fragments during retrieval.

1.1 Document segmentation and context supplementation

In traditional Retrieval-Augmented Generation (RAG) systems, the knowledge base is split into multiple small text chunks, each of which usually contains only a few hundred words. When these fragments are embedded as vectors independently, there is often a lack of sufficient background information within the fragments, which may lead to context loss. For example, a fragment may only contain "revenue increased by 3%" without specifying which company or time period it is.

Contextual Embeddings solves this problem by generating a short contextual description for each text snippet. The specific steps are as follows:

Add context to the fragment: Add the context of the fragment (such as the title of the document, company name, time, etc.) to the front of the fragment to form a more complete semantic unit. For example, "Revenue increased by 3%" is supplemented to "This is ACME's second quarter 2023 financial report: revenue increased by 3%."

Generate semantic embeddings: After supplementing each fragment with contextual information, these fragments are converted into vector embeddings to represent their semantic information. These embedding vectors can better reflect the complete meaning of the fragment, so that when retrieving, the model can more accurately capture the content related to the query.

1.2 Context Generation

The generation of context is usually done automatically through a pre-trained language model (such as Claude). Developers write prompts to instruct the model to generate corresponding background descriptions for each fragment, rather than manually adding context to each fragment. This greatly improves the efficiency of processing large-scale knowledge bases.

2. Contextual BM25

BM25 is a classic text matching algorithm that handles exact matches of specific phrases or terms. In Contextual Retrieval, BM25 is used to address specific matching issues that may be overlooked by embedding models. For example, embedding models are usually good at capturing broad semantic similarities, but may miss exact matches in queries (such as technical terms, codes, error codes, etc.).

2.1 How BM25 works

BM25 is an improvement on term frequency-inverse document frequency (TF-IDF) and is used to calculate the match between query terms and document fragments. It achieves exact matching through the following steps:

Word frequency statistics: BM25 calculates the frequency of occurrence of each word in a document, measuring the importance of each word in the document. Common words (such as "the" and "a") have less influence, while technical terms or proper nouns have higher importance.

Document length normalization: BM25 takes into account the impact of document length to prevent long documents from getting too high a score because they contain more words. It normalizes the word frequency to make the matching of short and long documents fairer.

2.2 Combination of Contextual BM25

Contextual Retrieval combines BM25 with the embedding model to achieve the dual effects of exact matching and semantic matching:

Semantic matching: The embedding model captures the semantic similarity between the query and the knowledge base fragment by generating semantic embeddings.

Exact match: BM25 is used to capture direct matches between specific words or phrases in the query (such as the error code "TS-999") and the document.

Contextual Retrieval combines the results of these two methods to ensure that the model can perform semantic reasoning while not ignoring keywords in the query.

3. Context-enhanced retrieval process

In Contextual Retrieval, the entire retrieval process includes the following steps:

Chunking and context generation:

Split the knowledge base into small chunks, each containing no more than a few hundred words.

Use the model to generate contextual descriptions for each chunk, ensuring that the snippet contains sufficient background information when it is retrieved.

Generate embeddings and BM25 indexes:

Each snippet with context is converted into a semantic embedding vector through an embedding model such as Gemini or Voyage.

At the same time, a BM25 index is generated for each fragment for subsequent exact matching.

Search and sort:

When a user enters a query, the model first performs semantic retrieval in the vector database through the embedded model to find the fragment closest to the query semantics.

At the same time, the BM25 system searches for the exact terms in the query and finds documents that match the terms.

The semantic retrieval results are fused with the exact match results of BM25 and the results are re-ranked.

Context Fusion and Generation:

The most relevant snippets are picked and added to the input prompt of the generative model to generate the final response.

4. Application of Reranking Technology

In order to further improve the accuracy of contextual retrieval, reranking technology is introduced into the retrieval process. Specifically, after obtaining multiple candidate fragments through preliminary retrieval, the reranking model will score them according to the relevance and importance of the fragments to ensure that the most relevant fragments are ranked first. This step can optimize the retrieval results and reduce the overhead of processing redundant information.

In the final step, contextual retrieval can be combined with another technique to further improve performance. In traditional RAG, the AI system searches its knowledge base to find potentially relevant chunks of information. For large knowledge bases, this initial search typically returns a large number of text chunks—sometimes hundreds—of varying relevance and importance.

Re-ranking is a common filtering technique to ensure that only the most relevant chunks of text are passed to the model. Re-ranking not only provides better responses, but also reduces costs and latency because the model has less information to process. The key steps are as follows:

A preliminary search was conducted to obtain the most likely relevant text blocks (the first 150 were used);

Pass the top N text chunks along with the user query to the re-ranking model;

Use a re-ranking model to score each chunk of text based on its relevance and importance to the prompt, and then select the top K chunks (we used the top 20);

The first K text blocks are passed to the model as context to generate the final result.

*Combining contextual retrieval and re-ranking to maximize retrieval accuracy*

Cost and delay considerations

An important consideration when using reranking is the impact on latency and cost, especially when a large number of chunks need to be reranked. Since reranking adds an extra step at runtime, it inevitably adds a little latency, even though the reranking model scores all chunks in parallel. Reranking more chunks results in better performance, but also increases latency and cost. We recommend experimenting to find the right balance for your specific use case.

Using Prompt Caching to reduce the costs of Contextual Retrieval

In Contextual Retrieval , using prompt caching technology can significantly reduce costs and delays, especially when dealing with large-scale knowledge bases. Prompt caching effectively optimizes the efficiency of the retrieval process by reducing the overhead of repeated operations. The following are specific implementation methods and their cost optimization effects:

What is Prompt Caching?

Prompt caching is an optimization technique used to avoid reloading the same reference documents every time a query is made. By caching commonly used hints or documents, the system can directly reference the cached content in subsequent API calls without having to reprocess the entire knowledge base each time. This greatly reduces repeated calculations and reduces system operating costs and latency. (You can read the prompt caching manual to learn how it works)

Role in Contextual Retrieval

In contextual retrieval, documents are divided into multiple small chunks and context descriptions are added to each chunk. Since the context of a document fragment may be reused in multiple retrievals, hint caching allows the model to generate context and cache these results during the first retrieval, and subsequent retrievals can directly extract context information from the cache without regenerating it.

How to use prompt caching?

Loading a document for the first time: When the system encounters a document for the first time, it loads the document into the model and generates contextual embeddings for each document fragment as needed.

Caching context information: This context information and fragment embeddings are cached to avoid repeated generation in subsequent retrievals. The model can quickly complete retrieval by simply referencing the previously cached context.

Reduced computational cost: When subsequent retrieval calls for the same document fragment, the model can extract the already generated context from the cache without reprocessing the document or regenerating embeddings, which greatly reduces computational overhead.

Specific cost optimization effect

The Anthropic team's experimental results show that using a hint cache can significantly reduce the cost and latency of context retrieval:

Reduced costs: Prompt caching can reduce the cost of generating contextual embeddings by up to 90%. Assuming a knowledge base containing 1 million document tags, generating context and embedding for each document fragment costs about $1.02, and by caching, there is almost no additional cost for subsequent retrieval.

Reduced latency: Prompt caching can reduce latency during retrieval by more than half (over 2x latency reduction) because the system no longer needs to regenerate context embeddings or reload documents on every call.

How to implement prompt caching?

The implementation of the prompt cache is relatively simple. Here are the general steps provided by Anthropic:

Cache loading: When first retrieved, the document and its context embedding are stored in the cache.

Cache reference: In subsequent API calls, the system directly references the content in the cache without repeated processing.

Update the cache regularly: For documents or dynamic knowledge bases that change frequently, you can set up a mechanism to refresh the cache regularly to ensure that the content in the cache is always kept up to date.

Applicable scenarios

Hint caching is particularly useful in the following scenarios:

Re-used knowledge base: If certain documents or knowledge bases are frequently accessed and retrieved, hint caching can significantly reduce the cost of repeatedly generating context embeddings.

Large-scale knowledge base: When processing extremely large document collections, prompt caching can effectively improve retrieval efficiency and reduce the consumption of computing resources.

Technical Advantages of Contextual Retrieval

Compared with the traditional retrieval-augmented generation (RAG) system, Contextual Retrieval has significant technical advantages in the following key aspects:

1. Improved context completeness

Traditional RAG systems tend to lose contextual information when splitting documents into small chunks for processing. This results in the retrieved chunks lacking critical context, reducing the relevance and accuracy of the retrieval results. Contextual Retrieval automatically generates contextual explanations for each chunk, ensuring that sufficient context information is retained when chunking.

Technical advantages:

By adding specific context to each fragment (such as company name, time, event, etc.), we ensure that the fragment can provide complete background information when searching and avoid information fragmentation.

The semantic coherence of the fragments is enhanced, enabling the retrieval system to understand the relationship between the query and the document more accurately.

2. Combination of exact matching and semantic matching

Contextual Retrieval combines Contextual Embeddings and BM25 , an exact word matching algorithm, to achieve a balance between semantic matching and exact matching.

Technical advantages:

Contextual Embeddings provide deep semantic understanding, can handle complex natural language queries, and capture the semantic relationship between queries and document fragments.

BM25 provides matching based on exact words or phrases, and is particularly suitable for processing queries that contain technical terms, codes, error codes, etc. that require precise positioning.

The combination of the two enables the system to handle a wide range of semantic similarities while ensuring accurate retrieval of keywords or phrases.

3. Higher retrieval accuracy

Experiments show that the system using Contextual Retrieval can significantly reduce the retrieval failure rate compared to the traditional RAG solution. Through contextual supplementation and multiple retrieval techniques, the system can more accurately find the most relevant fragments.

Technical advantages:

Context embedding reduces false detection of irrelevant or partially matched fragments and improves the accuracy of system retrieval.

Combined with contextual BM25 technology, it can perform better when processing technical queries, identifiers, and exact text matches, greatly improving retrieval accuracy.

In complex query scenarios, Contextual Retrieval reduces the retrieval failure rate by 49%. When combined with the reranking technology, this ratio even drops to 67%.

4. Scalability and efficiency improvement

Contextual Retrieval provides an efficient way to process large-scale knowledge bases, especially for those situations where it is not possible to embed all information directly into the prompt. Through the context-enhanced retrieval method, large-scale data can be effectively processed without significantly increasing the complexity of the system.

Technical advantages:

Support for large-scale knowledge bases: Contextual Retrieval can efficiently process knowledge bases with millions of documents, ensuring that the system can accurately find relevant fragments when faced with huge amounts of data.

Cost and efficiency improvement Prompt Caching: Through Claude's

technology, Contextual Retrieval can significantly reduce computing overhead and latency when retrieving the same knowledge base multiple times, thereby improving system efficiency.

5. Reranking enhancement

In Contextual Retrieval, reranking technology further improves the relevance of retrieval results. By scoring and ranking the initially retrieved snippets, the system can ensure that the snippets output in the end are the most relevant to the user's query.

Technical advantages:

More accurate search results: Re-ranking technology can filter out low-relevance fragments and prioritize the most relevant content, further improving the quality of answers generated by the model.

Reduce redundant information: Reduce the number of irrelevant or less relevant fragments processed by the system, optimize the use of computing resources, and speed up the generation speed.

The re-ranked contextual embeddings and contextual BM25 reduce the retrieval failure rate of the top 20 text chunks by 67% (5.7% → 1.9%).

*The re-ranked contextual embedding and contextual BM25 reduce the retrieval failure rate of the top 20 text blocks by 67%.*

6. Easy to implement and adaptable

Contextual Retrieval enables developers to quickly deploy the technology by automatically generating context and easily integrating it with existing retrieval systems such as RAG.

Technical advantages:

Automatic context generation

: Instead of manually adding context information for each snippet, developers can automatically generate context descriptions through preset prompts to meet application requirements in different fields.

Compatible with existing technologies

: Contextual Retrieval can be flexibly combined with embedding models, BM25, and re-ranking models. Developers can adjust these modules according to specific needs to obtain the best retrieval effect.

Summary

The technical advantages of Contextual Retrieval are as follows:

Contextual completeness: Ensure that the fragment contains sufficient background information when it is retrieved to reduce information loss.

Combination of precise and semantic matching: By combining Contextual Embeddings and BM25, a balance between semantic reasoning and precise word matching is achieved.

Improve retrieval accuracy: Significantly improve retrieval accuracy and reduce retrieval failure rate in complex query scenarios.

Efficiently handle large-scale knowledge bases: Even in the face of huge knowledge bases, Contextual Retrieval can still operate efficiently and is suitable for a wide range of application scenarios.

Enhanced re-ranking technology: Further optimize the relevance and accuracy of search results through re-ranking technology.

Easy to implement and apply: Developers can quickly deploy contextual retrieval technology through automated context generation and modular retrieval combinations.

These advantages make Contextual Retrieval an ideal solution for improving the efficiency and accuracy of large-scale knowledge base retrieval, and is suitable for scenarios that require high-precision information retrieval and complex query processing.

Experimental Results of Contextual Retrieval

In the Contextual Retrieval experiment, the Anthropic team conducted an in-depth test of its performance, and the results showed that this technology significantly improved the accuracy and efficiency of retrieval. The following are the specific performance of the experimental results:

1. Experimental Results of Contextual Embeddings

By introducing context into each text segment, the effect of Contextual Embeddings is significantly improved. The main experimental results show:

The retrieval failure rate dropped by 35%: By adding contextual information to each text snippet, the retrieval model can more accurately understand the semantics of the query, resulting in a drop in the retrieval failure rate from 5.7% to 3.7%.

Enhanced semantic retrieval performance: Contextual Embeddings performs well on semantic similarity tasks and can better capture the contextual relevance of text fragments to queries, especially in long documents or complex query scenarios.

2. Experimental Results of Contextual Embeddings + BM25

After combining BM25 (an exact word matching algorithm), Contextual Retrieval further improved the retrieval performance in the experiment. The main performance is:

The retrieval failure rate dropped by 49%: By combining context embedding and BM25 technology, the system achieved a good balance between semantic matching and exact matching, and the retrieval failure rate dropped from 5.7% to 2.9%.

More accurate exact matches: The introduction of BM25 enables the system to perform better when processing exact matches such as technical terms, code identifiers, or specific error codes, especially in scenarios where precise retrieval of specific terms is required.

3. Experimental results of reranking

The addition of re-ranking technology further improves the retrieval effect of the system. By re-ranking the initial retrieved fragments, the most relevant fragments are prioritized. The experimental results show:

The failure rate of retrieval dropped by 67%: After incorporating re-ranking, the system’s failure rate of retrieval dropped from 5.7% to 1.9%, which means that the system is more accurate in finding the most relevant text snippets to the query.

Further improve the relevance of query results: Reranking technology ensures that the information processed by the model is more accurate, reduces the interference of irrelevant information on the generated results, and improves the quality of answers generated by the system.

4. Cross-domain experimental performance

The experiments cover different domains, including code bases, novels, ArXiv papers, scientific papers, etc. The results show that Contextual Retrieval significantly improves retrieval performance in all domains. ( Appendix II contains some examples of questions and answers for each domain.)

Performance on Retrieval@20 : In the evaluation, the system's Recall@20 (i.e., the ability to find relevant information in the top 20 retrieval results) was significantly improved. After combining contextual retrieval and reranking, the average performance increased by 67%.

Consistent performance across domains: Whether processing technical or non-technical documents, Contextual Retrieval achieves significant improvements in performance across all datasets, demonstrating its wide applicability.

5. Experimental Summary

Combination of Contextual Embeddings and Contextual BM25: Contextual Retrieval, which combines contextual embeddings and exact matching, significantly reduces the retrieval failure rate and improves the system's retrieval ability in complex knowledge bases.

Further improvement through re-ranking: By incorporating re-ranking technology, the relevance and accuracy of system retrieval are significantly enhanced, especially in large-scale knowledge bases.

Applicable to multiple fields: Experimental results show that Contextual Retrieval significantly improves the retrieval effect in all scenarios, whether it is processing technical or non-technical documents.

This shows that Contextual Retrieval performs well in improving the retrieval task of large-scale knowledge bases, and is particularly suitable for complex query scenarios that require high-precision retrieval.

Conclusion

We conducted extensive testing comparing different combinations of all the above techniques (embedding models, use of BM25, use of contextual retrieval, use of reranking, and the total number of top K results retrieved) across a wide variety of dataset types. Here is a summary of our findings:

Embedding+BM25 performs better than embedding alone;

Voyage and Gemini are the best embedding models we have tested;

Passing the first 20 text chunks to the model is more effective than just the first 10 or first 5;

Adding context to text blocks greatly improves retrieval accuracy;

Reordering is better than not reordering;

All of these improvements are additive: to maximize performance, we can combine contextual embeddings (from Voyage or Gemini), contextual BM25, a re-ranking step, and add 20 text blocks to the prompt.

Developers using the knowledge base can experiment with these methods through the user manual to unlock new levels of performance.

Appendix I

Below is the breakdown of results by dataset, embedding provider, whether or not BM25 is used, whether or not contextual retrieval is used, and whether or not Reranked Retrievals@20 is used.

See Appendix II for the decomposition results of Retrievals @ 10 and @ 5, as well as examples of questions and answers for each dataset.

🔥

More Power for Less: RackNerd VPS, the budget-friendly choice