DataGemma: Real-World Data to Solve AI Illusions

type

status

date

slug

summary

category

icon

password

Google's latest DataGemma model aims to solve the "hallucination" problem of AI models by connecting to Google Data Commons, a database containing a large amount of real data, so that the model relies on credible and real statistical information when generating answers, thereby improving accuracy.

Data Commons is a massive, ever-expanding public data platform filled with reliable public information from trusted organizations around the world, such as the United Nations, WHO, and national statistical offices. It brings together more than 240 billion data points in areas such as health, economics, demographics, and the environment. You can interact with it using our AI-driven natural language interface . For example, you can explore which countries in Africa have the fastest access to electricity , or look up the relationship between income and diabetes rates in U.S. counties , or other topics of interest.

These data are organized so that anyone can query them in simple language and obtain reliable information.

DataGemma makes the output of Large Language Models (LLMs) more accurate by providing real statistics from trusted data sources.

How DataGemma Works

The DataGemma model makes AI-generated answers more accurate in two main ways:

RIG (Retrieve Insert Generate): Before the AI answers a question, it retrieves relevant real-world data from Data Commons and then generates an answer. For example, if you ask “Is the use of renewable energy increasing globally?”, DataGemma will fetch relevant data from Data Commons to ensure the answer is fact-based.

RAG (Retrieval Augmented Generation): This approach not only allows the AI to obtain more context when answering questions, but also generates more detailed responses. The AI looks for relevant contextual data before it starts answering the question. This further reduces the possibility of the AI generating incorrect information.

1. RIG (Retrieval Insertion Generation)

The core of the RIG method is to retrieve relevant real data before generating answers . The specific process is as follows:

When users ask AI questions, DataGemma proactively checks whether the questions involve statistical data or specific factual information.

If the question contains these elements, DataGemma will first retrieve relevant and accurate data from Data Commons (Google's trusted data platform).

The AI then inserts this real data into the answer when generating it, ensuring that the answer is based on a trustworthy source.

Example : When a user asks “Is the use of renewable energy increasing globally?” DataGemma finds the latest data on global renewable energy use from Data Commons and uses it to generate an accurate response.

2. RAG (Retrieval Augmented Generation)

The RAG method allows AI to not only retrieve data but also obtain more background information and context before generating answers, thereby generating more detailed and accurate answers. The workflow is as follows:

After the user asks a question, DataGemma will first retrieve background data related to the question from Data Commons to help AI understand the full picture of the question.

With a longer context window, AI can combine this background information to generate more detailed and complete answers, reducing the possibility of errors.

Example: For the same question “Is the use of renewable energy increasing globally?”, DataGemma uses the RAG approach to not only provide the data, but also generates a more comprehensive answer based on relevant context (such as energy use in different countries), and the answer may include footnotes or explanations of the data source.

3. How Data Commons supports these two approaches

Data Commons is the data support platform behind DataGemma, which includes globally trusted public data sources such as the United Nations, the World Health Organization (WHO), and the Centers for Disease Control and Prevention (CDC). This database covers multiple fields (such as health, economy, environment, etc.), and through the combination with DataGemma, AI can access these trusted data at any time to generate more reliable answers.

In this way, DataGemma ensures that AI-generated answers no longer rely solely on training data, but incorporate real-time, accurate external data , thereby reducing the occurrence of hallucinations (i.e., the model generates incorrect answers). The two methods, RIG and RAG, work together to help AI models be more accurate and reliable when answering questions involving facts and data.

Initial results on the RIG and RAG methods are promising. The accuracy of the models is significantly improved when processing numerical facts, which means that users will experience fewer hallucinations in a variety of applications such as research, decision making, or satisfying curiosity. You can explore these results in our research paper .

Researchers and developers can also use these quick start notebooks to get started with DataGemma, for both RIG and RAG methods. To learn more about how data sharing and Gemma work together, read the research article.

Original article: https://blog.google/technology/ai/google-datagemma-ai-llm

🔥

VPS Peace of Mind: Trust RackNerd for reliable performance and exceptional support