Generate Complex AI Math Problems with Advanced Synthetic Datasets

type

status

date

slug

summary

Specific Methods

The core method proposed in the paper is to combine two different mathematical skills to generate "out-of-distribution" problems, making these problems more difficult for both LLMs and humans.

The paper proposes a new approach: using LLM's metacognition to extract "mathematical skills" from existing datasets and randomly combine skills to generate more difficult problems.

The generated question set is named MATH², which is verified and improved through agent interaction with two cutting-edge LLMs and then validated and adjusted by human experts.

MATH² contains 180 questions, 56% of which have not been modified, and the rest have been fine-tuned or modified by human experts. The difficulty is much higher than the MATH dataset.

The success rate of a model on the MATH^2 dataset is usually the square of its success rate on the MATH dataset, hence the name of the dataset.

This approach is not only applicable to mathematics, but may also impact other fields that require structured reasoning, potentially changing the way educational content is produced in various fields.

Future prospects: Automate more of the human verification process, expand the framework to generate high-quality data in other disciplines, and improve the learning experience for both AI and humans.

The main method is to combine the framework of large language model (LLM) and human experts to generate challenging mathematical problems through multiple rounds of generation, verification and optimization. The key steps of this framework include skill extraction, problem generation, model answering, problem verification, solution generation and re-verification. The following are the detailed steps and methods:

1. Skill Extraction1

Extract core math skills from existing math datasets such as the MATH dataset. For example, "Algebra", "Geometry", etc. These skills are the basis for generating new questions. LLM uses recent metacognitive techniques to identify and classify different math skills from existing datasets.

These skills are used as the basis for generating new problems. In this paper, two different mathematical skills are combined to generate interdisciplinary problems. This approach can effectively increase the difficulty and diversity of the problems.

2. Skill Pair Validation

The extracted skills are randomly combined. LLM first validates these combinations to ensure that the application of the two skills is unique but not too similar.

By comparing the reference skill descriptions with example questions, the model will flag inappropriate skill combinations.

3. Question Generation

Using a proven skill set, the LLM generates math problems that require the use of these two different skills.

LLM will be provided with multiple interactive examples to show how humans and AI can work together to improve questions to avoid generating unclear or logically incoherent questions.

4. Solution Attempt

Once a problem is generated, the LLM will attempt to solve it. The model will work in an “adversarial” way to find deficiencies or errors in the problem and complete the solution if possible.

In this step, the model is agnostic to the skill names used to ensure fair and comprehensive question answering.

5. Question Validation

LLM verifies the generated questions and their answers to ensure that the questions meet the expected quality standards. Specific verification criteria include:

Single answer: Questions should require only one clear answer.
Exact Answers: Answers must be exact, unless the question allows approximate answers.
Skill requirements: The questions must require two different skills to answer and must be of equal or greater difficulty than the reference questions.
Computational feasibility: The problem should not require overly complex calculations and should be solvable in a reasonable amount of time.
Realism and logic: Problem scenarios must be realistic and logically consistent.

A majority voting (Maj@4) mechanism is used to ensure that the generated questions have been verified multiple times and that both the questions and answers meet the standards.

6. Final Solution and Re-validation

For questions that have passed verification, LLM will re-answer and generate the final answer. The accuracy and consistency of the answer are ensured through a majority voting mechanism.

Any situation that produces different answers may indicate ambiguity or unclear logic in the question and the question will be discarded.

7. Human-AI Collaboration

Human experts review the generated questions and answers to verify the validity of the questions and optimize the difficulty and clarity of the questions. Human experts often modify the questions generated by LLM to make them more interesting or challenging.

The intervention of human experts not only improves the quality of generated questions, but also effectively increases the diversity of questions.

8. Dataset Creation and Evaluation

Using the above method, the paper generated a new mathematics dataset MATH2, which contains more challenging questions than the MATH dataset.

By comparison, the MATH2 problem significantly reduces the performance of all models, verifying the difficulty and effectiveness of the problem.

Example

In simple terms, this method can be understood as a process of "human-computer collaboration" to generate more difficult math problems, which is divided into four simple steps:

1. Extract skills from existing math question banks

First, we use large language models (LLMs) to find and label some “skills” from the existing math question bank. These skills are similar to small knowledge points in mathematics, such as algebra and geometry.

2. Random combination of skills generation problem

Next, the model will randomly combine these skills. For example, it may combine "algebra" and "geometry" to generate a new question. Because such combinations are rare in the model's training, these new questions will become particularly challenging, and may even combine different fields like complex theorems in mathematics.

3. The two models check each other's questions

The two state-of-the-art models then check each other’s generated questions and answers to ensure that the questions are correct and the answers are accurate.

4. Human experts give final review

Finally, human experts look through these generated questions, select the best ones, check the answers, and sometimes slightly modify the questions or answers to ensure they are of higher quality.

Let us explain the methodology and flow of the paper through a simple example.

Generate a math problem that requires "Algebra" and "Geometry"

Extracting math skills

AI extracts the two skills of "algebra" and "geometry" from the existing math question bank. For example, algebra skills may be about solving equations, and geometry skills may be about calculating the area of a triangle.

Skill Combination

AI will combine these two skills and prepare to generate a question that requires both skills. This step ensures that the question requires multiple knowledge and is not too simple.

Generate a question

Based on these two skills, AI generates a question like this:

This problem combines geometry (calculating the hypotenuse) and algebra (solving equations). The first part is using the Pythagorean theorem to calculate the hypotenuse, and the second part is solving the equation, combining the two skills together.

AI tries to solve

the problem. First, AI uses the Pythagorean theorem to calculate that the length of the hypotenuse is 5. Then, AI solves the equation x2−5x+6=0x^2 – 5x + 6 = 0 x 2 − 5 x + 6 = 0 and gets two solutions: x=2 and x=3.

Verify the question

AI checks the question to ensure that the logic is clear, the problem difficulty is appropriate, and each step has a unique solution. If the problem is found to be wrong, the AI will mark it and regenerate it.

Final Answer and Revalidation

The AI will answer the question again and confirm the correctness of the answer through multiple attempts. For example, the AI will confirm that the most common answer is correct through a majority vote mechanism.

Human experts help

Human experts review the AI-generated problem. The expert may feel that the algebra part of the problem is a bit too easy and decide to optimize it, such as changing the equation to a more complex form or adding more geometry to increase the challenge.

Generating a new data set

Finally, such questions were added to a new math question bank. The questions in this question bank are more complex and challenging than the original data set, and are specifically designed to test AI's multi-skill reasoning ability.

Experimental Results

The results show that LLM performs poorly on the newly generated MATH2 dataset, while the efficiency and quality of question generation are significantly improved with the assistance of human experts. This framework can be applied to other fields that require structured reasoning and is expected to serve as a scalable component for AI supervision.

The paper verified the effectiveness of this method through experiments, and the results showed that the generated mathematical problems significantly increased the difficulty and performed well in improving the reasoning ability of AI models. The following is a summary of the specific effects:

1. The difficulty of the MATH2 dataset has increased

The paper generates a new dataset of math problems, called MATH², which is more difficult than the original MATH dataset.

In the experiments, all models (including advanced LLMs such as GPT-4, Claude, etc.) performed worse on the MATH2 dataset than on the MATH dataset, indicating that the new problem is indeed more challenging.

comparison of the performance of various models on MATH

2.Model performance deteriorates

Experiments show that the accuracy of the model on the MATH2 dataset has generally decreased. For example, GPT-4's performance dropped from 77.21% on the MATH dataset to 66.85% on MATH². This is because MATH2 problems require a combination of multiple skills, which are more complex and difficult to solve than before.

The performance of small models (such as MetaMath and MAmmoTH) on MATH2 drops more significantly, indicating that MATH² poses a challenge to models of different sizes.

comparison of Zero-Shot CoT Performance on the Generated Dataset vs. MATH Test Set

3.Success with multiple skill sets

Experimental results show that MATH2's problems combine different mathematical skills (such as geometry and algebra) to make these problems have an "out-of-distribution" characteristic, that is, these problems are unlike traditional problems that the model has seen, and therefore are more challenging for AI models.

AI showed some difficulty in these multi-skill combination problems, verifying that this generative method can examine the model's reasoning ability rather than simple computing ability.

4. Improvement of human expert collaboration

The collaboration between human experts and AI further improves the quality and diversity of questions. Through expert optimization of the questions, the generated questions are more interesting and challenging, especially avoiding repetitive or overly simple computational problems.

During the experiments, human experts modified and verified some of the questions, which improved the quality of the generated questions and ensured that they were equally challenging for both AI models and human testers.

5. The effect of questions as examples

The problems in MATH² are not only used for testing, but also can be used as "in-context exemplars" in model training. Experimental results show that when MATH2 problems are used as examples for MATH dataset testing, the model performs better than when the MATH dataset problems are used directly as examples.

6. Performance of the open source model

Open source models (such as MetaMath, Llama, etc.) perform relatively poorly on the MATH² dataset, further demonstrating the importance of these problems in improving the model's reasoning ability. Using more difficult problems can help the model perform deeper reasoning training.

Paper: https://arxiv.org/pdf/2407.21009

🔥

VPS with a Personal Touch: Enjoy dedicated support from RackNerd