Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Subscribe to our daily and weekly newsletters for the latest updates and content from the industry’s leading AI site. learn more
It’s a nightmareor wrong answers, continue to plague large linguistic models (LLMs). Models fail especially when given complex tasks and when users are looking for specific and detailed answers.
It is difficult data scientists struggled to fight, and now, researchers from Google DeepMind to say that they have come very close to achieving realism in foundational models. They have introduced FACTS Grounding, a benchmark that tests LLM’s ability to produce accurate answers based on long texts. Models are also judged on whether their answers are detailed enough to provide useful, useful information.
Along with the new benchmark, researchers have released a FACTS photo to Kaggle’s data science community.
As of this week, Gemini 2.0 Flash topped the leaderboard, with an accuracy rate of 83.6%. Some of the top 9 include Google’s Gemini 1.0 Flash and Gemini 1.5 Pro; Anthropic’s Clade 3.5 Sonnet and Claude 3.5 Haiku; and OpenAI’s GPT-4o, 4o-mini, o1-mini and o1-preview. These all ranked above 61.7% in terms of accuracy.
The researchers say that the board will be actively maintained and continuously updated to include new versions and their different iterations.
“We believe that this benchmark fills the gap in assessing the diversity of evidence-based practices, compared to benchmarks that focus on a limited number of cases… technical paper published this week.
To make sure really accurate in the LLM solutions are difficult due to modeling (structures, studies and ideas) and measurement (analytical methods, data and metrics). In general, researchers show, prior training is aimed at predicting the next signal given the previous signals.
“Although this objective can train the model for global knowledge, it does not advance the model in different situations, instead encouraging the model to generate more time. clear notes,” the researchers wrote.
To address this, the FACTS database includes 1,719 samples – 860 public and 859 private – each requiring lengthy responses based on the content of the provided documents. Each example contains:
In order for things to go smoothly and to be written “correctly,” example They must prepare a long document and create a subsequent long answer that is comprehensive and relevant to the document. Answers are marked as “incorrect” if the model statements are not directly supported by the document and are not particularly relevant or useful.
For example, a user can ask a model to summarize the main reasons why a company’s revenue fell in Q3, and provide it in detail including a company’s financial report that details quarterly earnings, expenses, planned businesses and market analysis.
If the example is, let’s say, “the company experienced problems in Q3 that affected its revenue,” it would be considered incorrect.
“The solution avoids explaining any reasons, such as market conditions, increased competition or service barriers, which may be in the documents,” the researchers explain. “It doesn’t show an attempt to engage or release information.”
Conversely, if a user asks, “What are some tips for saving money?” and giving tips on how to save money in college student groups, the correct answer might be detailed: “Take advantage of free events on campus, shop around and cook at home. Also, set spending goals, avoid credit cards and save money.”
To allow for a variety of inputs, the researchers combined texts of varying lengths, up to 32,000 tokens (or the equivalent of 20,000 words). This covers areas such as finance, technology, business, medicine and law. User requests are extensive, including Q&A generation, summary requests and rewrites.
Each model is judged in two stages. First, solutions are evaluated for suitability: If they do not meet user requests, they are banned. Second, the responses should be free of bias and specific to the text provided.
This realism is calculated by three different LLM judges – specifically Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet – who determine individual scores based on the number of correct model outputs. After that, the final determination of facts is based on the average of three judges.
Researchers say that samples are often biased towards some members of their sample family – at a maximum increase of 3.23% – so the combination of different judges was necessary to ensure that the answers were indeed true.
In the end, the researchers emphasize that the facts and background are important for the future success and use of LLMs. “We believe that similar processes, including ongoing research and development, will continue to improve AI systems,” he writes.
However, he also admits: “We know that benchmarks can be quickly scaled up and progressed, so the launch of our FACTS Grounding benchmark is just the beginning.”