LTM: 100 Million Tokens of Context

type

status

date

slug

summary

Problems that LTM Solves

Limits on context length

Limitations of traditional models: Most traditional AI models can only process relatively short contexts during reasoning. This means that when the model needs to understand or generate complex content, it can only use limited contextual information and tends to ignore early content, resulting in inaccurate or incomplete understanding or generation results.

LTM breakthrough: The LTM model solves this problem by being able to handle very long contexts. It can incorporate a large amount of related information (such as the entire code base, documents, historical conversations, etc.) into the context, thereby providing more accurate results when generating and understanding complex content.

Fuzzy memory dependencies

Problems with traditional models: Traditional AI models often rely on fuzzy memory, that is, they rely on fuzzy patterns learned during training rather than precise information when reasoning, which performs poorly when handling complex and diverse tasks.

Improvements in LTM: The LTM model is designed to effectively store and retrieve large-scale contextual information during reasoning, which reduces reliance on fuzzy memory and thus improves the accuracy and reliability of reasoning.

Complex reasoning tasks

Shortcomings of traditional evaluation methods: Many long-context evaluation methods cannot effectively test the reasoning ability of the model in practical applications, because they may provide some implicit hints to the model, reducing the difficulty of testing.

Innovative evaluation method for LTM: To solve this problem, the Magic team designed the HashHop evaluation method, which avoids semantic cues and allows the LTM model's reasoning and information retrieval capabilities in real scenarios to be tested more rigorously. This also enables the LTM model to perform better in dealing with practical tasks such as code synthesis and complex decision-making.

LTM-2-mini model

For the first time, the Magic team trained a model that can handle 100 million token contexts - LTM-2-mini. This model's sequence dimension algorithm at each decoded token is about 1000 times cheaper than the attention mechanism of Llama 3.1 405B1, and its memory requirements are also significantly reduced. Compared with Llama 3.1 405B, the LTM model only needs a little memory to store the same size context.

The LTM-2-mini model demonstrated excellent performance in handling multi-step reasoning tasks through chain of thought training. For example, the model was able to achieve up to 100% accuracy in complex hash chain reasoning tasks, indicating that it can build more complex reasoning circuits than single-step induction heads when handling more complex reasoning tasks.

1. Ultra-long context processing capability

100 million token contextabout 10 million lines of code or the content of 750 novels: LTM-2-mini is able to process up to 100 million token context information, which is far beyond the processing capacity of traditional models. 100 million tokens is equivalent to, allowing the model to fully utilize a large amount of relevant information in complex tasks.

2. Efficient computing and storage mechanism

Lower computational cost: LTM-2-mini’s sequence dimension algorithm is about 1,000 times more computationally efficient than Llama 3.1 405B’s attention mechanism

when processing 100 million token contexts . This means it can process large-scale context data more quickly during reasoning.

Extremely low memory requirements: Compared to traditional models, LTM-2-mini has significantly lower memory requirements. Compared to the Llama 3.1 405B model that requires 638 H100 GPUs to store 100 million tokens, LTM-2-mini only needs to use a tiny fraction of the H100 GPU’s high bandwidth memory (HBM) to complete the same task.

3. Improved reasoning and storage capabilities

Chain of Thought Training: LTM-2-mini is trained using the Chain of Thought method, which allows it to perform well in multi-step reasoning tasks. The model is able to achieve up to 100% accuracy in complex hash chain reasoning tasks, indicating that it can build more complex reasoning circuits than single-step reasoning.

Strong information retrieval capability: In the HashHop evaluation method, LTM-2-mini is able to accurately store and retrieve key information in large-scale contexts, which makes it perform well in tasks that require long-term memory and multi-step reasoning.

4.Flexibility and application potential

Adaptability to complex tasks: LTM-2-mini demonstrates its potential in processing complex code bases and documents, and is able to complete code synthesis and editing tasks without human intervention. This flexibility makes it widely used in software development and other fields that require a lot of contextual information.

HashHop Evaluation Method of LTM

In order to more accurately evaluate the performance of the model in ultra-long contexts, the Magic team designed the HashHop evaluation method. Unlike traditional evaluation methods, HashHop eliminates semantic cues and tests the storage and retrieval capabilities of the model through random and incompressible hash pairs. This method ensures that the performance of the model when handling real tasks is evaluated more rigorously and objectively.

In traditional long-context evaluation methods, such as "Needle in a Haystack", the model needs to retrieve specific information from a long context. These evaluation methods have some significant flaws:

Dependence on semantic cues: Existing methods often provide significant semantic cues, allowing the model to complete the task by identifying abnormal information ("needles") without actually processing the entire context. This reduces the difficulty of evaluation and cannot accurately reflect the performance of the model in actual tasks.

Weakened storage and retrieval capabilities: Due to the existence of semantic cues, the storage and retrieval capabilities of the model are greatly reduced, and the evaluation results may not truly reflect the model's ability to handle complex contexts.

Design of HashHop Method of LTM

To overcome the above problems, the Magic team designed the HashHop evaluation method, the core idea of which is to eliminate semantic hints by using randomly generated hash pairs, thereby conducting a more rigorous test of the actual storage and retrieval capabilities of the model.

Use of hash pairs

Randomness and incompressibility: HashHop uses completely random and incompressible hash pairs as evaluation data. This means that when the model is faced with hash pairs, it cannot rely on any pre-learned patterns or semantic cues and must rely on its actual storage and retrieval capabilities.

Chain reasoning tasks: The model not only needs to complete the storage and retrieval tasks of a single hash pair, but also needs to handle a series of hash pair chain reasoning tasks. For example, given multiple consecutive hash pairs, the model is required to infer the final result in a multi-step chain. This simulates the situation in actual tasks that require multi-step reasoning and complex logical associations.

Multi-hop reasoning ability test

Spanning multiple context points: In HashHop evaluation, the model needs to be able to span multiple context points, reasoning and information jumping across the entire context space. This is similar to variable assignment or library import in code logic, requiring the model to maintain coherent logical reasoning capabilities between different points.

Increase the difficulty of evaluation: By requiring the model to complete multi-step reasoning tasks, such as reasoning directly from the starting point of the chain to the end point (jumping multiple steps), HashHop tests the model's higher-order reasoning ability and complex logic processing ability.

Avoid semantic hints

Eliminate implicit hints: HashHop completely eliminates semantic hints in the evaluation process by using random hash pairs, so that the model must rely on its actual storage and retrieval capabilities to complete the task. This makes the evaluation results more realistically reflect the model's ability to handle complex tasks.

Significance of HashHop Evaluation Method

Strict model testing: By eliminating semantic cues and introducing multi-step chain reasoning tasks, HashHop provides more rigorous testing standards, making the evaluation results more representative and accurate.

Real-world task simulation: The HashHop method simulates complex reasoning tasks in the real world, such as code analysis and logical reasoning, so that the evaluation results of the model can better reflect its performance in actual applications.

Promoting model innovation: Through more stringent evaluation standards, HashHop encourages developers to pursue higher storage and reasoning capabilities in model design and training, thereby promoting technological progress of AI models.

Showcase of LTM

GUI frame creation in context:

Model's real-time learning ability: Magic's LTM-2 model demonstrated its powerful ability to learn in real time. By creating a calculator in a custom context framework, the model demonstrated the ability to complete complex tasks without a pre-set framework such as React. This process was more challenging because the model only relied on the existing code base and chat prompts without using any files, editing history, or other clues.

Simple UI changes:

Autonomous editing of complex code bases: The model was also able to implement a password strength meter for the open source project Documenso without human intervention. This is a common feature that is often found in many web applications. Despite the relatively specific problem description, the fact that the LTM-2 model, which is much smaller than the state-of-the-art model, was able to autonomously complete this complex task shows that it is highly adaptive and has strong code understanding and editing capabilities.

Ongoing Training Work of LTM:

Training a large LTM-2 model: The Magic team mentioned that they are using the new supercomputer to train a larger LTM-2 model, which shows that they are committed to further improving the model's capabilities to cope with more complex and larger-scale tasks.

Building new supercomputers: Magic is working with Google Cloud and NVIDIA to build two new supercomputers on Google Cloud, Magic-G4 and Magic-G5.

Magic-G4: Powered by NVIDIA H100 Tensor Core GPU.
Magic-G5: Powered by NVIDIA GB200 NVL72, it has scalability and can be expanded to tens of thousands of Blackwell GPUs in the future.

New round of funding: Magic has raised a total of $465 million in new funding, with $320 million coming from new investors including Eric Schmidt (former Google CEO), Jane Street, Sequoia Capital, and Atlassian, as well as existing investors such as Nat Friedman, Daniel Gross, Elad Gil, and CapitalG.

Information source: https://magic.dev/blog/100m-token-context-windows

🔥