Beyond RAG: How incremental cache generation reduces latency, low load issues

Subscribe to our daily and weekly newsletters for the latest updates and content from the industry’s leading AI site. learn more

Retrieval-augmented generation (RAG) has become the default method for customizing large-scale models (LLMs) for information. However, RAG comes with up-front costs and can be delayed. Now, thanks to the advancement of long-term LLMs, businesses can bypass the RAG by filing all shareholder information as soon as possible.

A a new lesson and National Chengchi University in Taiwan show that by using long LLMs and storage methods, you can create flexible programs that surpass RAG pipelines. This method is called cache-augmented generation (CAG), this method can be a simple and good alternative to RAG in businesses where information can enter the model window.

RAG limitations

RAG is a practical way in response to open questions and special tasks. It uses retrieval algorithms to collect documents that match the request and add context so that LLM can generate accurate responses.

However, the RAG imposes several restrictions on LLM programs. The additional recovery phase introduces latency which can reduce the user experience. Results also depend the quality of document selection and sorting step. In many cases, the limitations of the models used for returns require that documents be divided into smaller groups, which can damage the delivery process.

And in general, RAG adds complexity to the LLM project, which requires the development, integration and improvement of other components. Additional resources slow down development.

Cache-augmentation recovery

*RAG (top) vs CAG (bottom) (source: arXiv)*

Another way to build a RAG pipeline is to put the entire document into a queue and have the model select the parts that match the request. This method eliminates the complexity of the RAG pipeline and the problems caused by return errors.

However, there are three major problems with prioritizing all documents quickly. First, a long description will slow down the model and increase the cost of the model. Second, it is LLM window length sets a limit on the number of documents that are related to the prompt. And finally, adding unnecessary information too quickly can distort the model and reduce the quality of its solutions. Therefore, just putting all your posts in a hurry instead of choosing the most suitable ones can damage the performance of this type of service.

The CAG’s approach seeks to adopt three key strategies to address these challenges.

First, modern storage solutions are making it faster and cheaper to create quick templates. The CAG’s intention is that information records will be included in the information submitted to the sample. Therefore, you can calculate their token information in advance instead of doing it when you receive requests. This forward calculation reduces the time it takes to process user requests.

Leading LLM providers such as OpenAI, Anthropic and Google offer backups of your review sessions, which may include information documents and instructions that you include at the beginning of your order. With Anthropic, you can reduce cost by 90% and latency is 85% for the stored parts of your data. The storage models are designed to be open to LLM-hosting platforms.

Second, Long LLMs making it easier to include more documents and information in applications. Claude 3.5 Sonnet supports up to 200,000 characters, while GPT-4o supports 128,000 characters and Gemini up to 2 million characters. This makes it possible to combine several articles or entire books quickly.

And finally, advanced teaching methods are enabling models to better find, reason and answer questions on very long lists. In the past year, researchers have developed several LLM indicators for long-term follow-up activities, including BABYLON, LongICLBBenchand THE CONTROLLER. These benchmarks test LLMs on complex problems such as multiple regression and answering multivariate questions. There is room for improvement in this area, but AI labs continue to make progress.

As new generations of models continue to expand their windows, they will be able to curate larger collections. In addition, we can expect Models to continue to improve their ability to extract and use relevant information from long-range data.

“These two features will greatly expand the applicability of our method, making it applicable to complex and diverse tasks,” the researchers write. “As a result, our approach is poised to be a powerful and flexible solution for the most demanding jobs, supporting the growth of the next generation of LLMs.”

RAG vs CAG

In comparing the RAG and the CAG, the researchers tested two well-known questionnaires: SQUADwhich focuses on the popular Q&A from individual posts, and The price of HotPotQAwhich requires multi-dimensional thinking on multiple documents.

They used a Flame-3.1-8B model with a window of 128,000 pixels. For RAG, they combined LLM with two retrieval systems to find passages relevant to the query: first. Image of BM25 and OpenAI extensions. For the CAG, they entered a number of articles from the benchmark and let the model choose which passages to use in answering the question. Their tests show that CAG outperformed all RAG systems in most cases.

*CAG outperforms both sparse RAG (BM25 retrieval) and dense RAG (OpenAI dataset) (source: arXiv)*

“By recoding the entire story from the test, our system eliminates retrieval errors and ensures critical thinking for all relevant information,” the researchers write. “This advantage is particularly evident in areas where RAG systems may also find incomplete or inappropriate passages, resulting in incomplete responses.”

CAG also significantly reduces response time, especially when text length increases.

*CAG generation time is less than RAG (source: arXiv)*

That said, CAG is not a silver bullet and should be used with caution. It is best suited to settings where the knowledge base does not change often and is too small to fit within the sampling window. Businesses should also be careful when their documents contain contradictions in terms of the content of the documents, which can distort the model during the decision-making process.

The best way to determine if CAG is right for your application is to run a few tests. Fortunately, the implementation of CAG is very simple and should always be considered as the first step before using the additional development methods of RAG.

Daily thoughts on business use cases by VB Daily

If you want to impress your boss, VB Daily has you covered. We provide you with the inside scoop on what companies are doing with AI output, from regulatory changes to practical solutions, so you can share insights for high ROI.

Read our Privacy Policy

Thank you for subscribing. See more VB articles here.

There was a problem.

Source link

Beyond RAG: How incremental cache generation reduces latency, low load issues

RAG limitations

Cache-augmentation recovery

RAG vs CAG

Leave a ReplyCancel Reply

Contact established: Liverpool Rival Arsenal for a versatile 25 g/a attacker

Aurora Co-Outer Sterling Anderson lets out to a drivers

Players ratings as Spurs to ensure an opportunity to end the 17-year-old trophy drought

RAG limitations

Cache-augmentation recovery

RAG vs CAG

Leave a ReplyCancel Reply

Trending now

Contact established: Liverpool Rival Arsenal for a versatile 25 g/a attacker

Aurora Co-Outer Sterling Anderson lets out to a drivers

Players ratings as Spurs to ensure an opportunity to end the 17-year-old trophy drought