Genie: AI Outperforms Humans in Programming Tasks

type

status

date

slug

summary

Main Features of Genie

Efficient problem solving

Genie is able to decompose problems, find relevant codes, debug codes and implement solutions just like a human engineer, demonstrating excellent logical thinking ability.

Genie has powerful code refactoring capabilities, which can optimize existing code and improve code efficiency and maintainability.

Precise File Identification: Genie’s ability to pinpoint the files needed for any task is a real game changer. It scans your project files with incredible accuracy, identifying the most relevant information to provide context for the issue at hand. By quickly narrowing down the necessary files and documents, Genie simplifies the initial stages of problem solving. This precise file identification provides a solid foundation for developing effective and efficient solutions.

**Precise File Identification of Genie**

Seamless integration with GitHub

Genie is tightly integrated with the GitHub issue tracker, and can automatically import issues in GitHub, fully understand task requirements, and generate detailed work specifications. This reduces manual input and improves the efficiency of task processing.

Self-learning and improvement

Genie’s training dataset is not just based on simple model prompts, but deeply encodes human reasoning capabilities. This allows Genie to think more like a human when solving problems, rather than randomly generating code.

Genie can autonomously resolve errors (bugs) in the software and is able to fix problems completely independently without human intervention.

Through training and self-generated synthetic data, Genie continuously improves its problem-solving abilities. It learns from its mistakes and makes fewer of them in subsequent versions.

**Self-learning and improvement of Genie**

Handling complex tasks

Genie is able to handle complex, never-before-seen tasks and iterate and test in a similar way to human engineers, ensuring the accuracy and usefulness of the output.

Multi-language support

Genie supports multiple programming languages, including JavaScript, Python, TypeScript, etc., enabling it to adapt to different development environments and needs.

Scalability and Customization

Cosine plans to fine-tune Genie so that it can be tailored to specific code bases and even handle older or less commonly used programming languages. This capability will allow Genie to deeply understand large legacy code bases and provide efficient support for them.

Architecture Design of Genie

Context Window Model

Genie was initially trained on short context window models (16-32k range). While these models showed some promise in early exploration, they were limited in the amount of information they could represent and could not fully demonstrate Genie’s capabilities.

To overcome this limitation, the Cosine team ultimately turned to training the model with a larger context window, which enabled Genie to handle more complex and larger-scale data, thereby improving the overall performance of the model.

Data compression and block technology

In the early architecture design, the team tried various methods of data compression and chunking to represent more information in a short context window. However, these methods were eventually replaced by the use of larger context models, which laid the foundation for Genie's performance improvement.

Modular Design

Genie's reasoning process is divided into four main parts: planning, information retrieval, code writing, and code running. Although these steps also exist in other tools, what makes Genie different is that it can complete each step like a human, thereby achieving higher performance.

Training Methods of Genie

Proprietary dataset training

Cosine has created a dataset based on real developer activities. These data not only include the developer's work results (such as submissions, PRs, issues, etc.), but also reconstruct the developer's implicit reasoning and decision-making path in the problem-solving process through a variety of techniques such as static analysis, artificial intelligence models, and self-verification.

Genie's training datasets are specially designed to simulate the cognitive processes, logic, and workflows of human engineers. These datasets include perfect information transfer, incremental knowledge discovery, and step-by-step decision-making processes, ensuring that Genie can handle complex programming tasks. By analyzing and labeling these reasoning processes, Cosine enables the Genie model to better simulate the way human engineers think, rather than just generating code.

Self-improvement mechanism

In the initial version of Genie, it was mainly exposed to data in a "perfect" state, that is, most of the time the code was already in a releasable state. As a result, Genie initially performed poorly when it came to dealing with errors. To address this, the team generated synthetic data through Genie and injected it into the training set of the next version.

Each version of Genie improves upon the previous one. By providing feedback and corrections to model outputs, Cosine enables Genie to learn and avoid past mistakes, continually improving its performance and reliability.

This approach allows Genie to gradually learn how to recover from an error state when faced with an error. As this process is repeated, Genie's initial candidate solutions become stronger and more robust, making fewer errors, ultimately improving overall performance.

Large-scale data training

In the latest training process, Genie was trained on billions of data tokens, and the mixture of data sets was selected to enable the model to handle the programming languages that current users are most concerned about. This large-scale data training enables Genie to perform well on a wide range of programming languages and task types.

Cosine believes that data quality is the key to successful training. During training, they conducted a large number of experiments on language, task type, task length, and other aspects to ensure that the final data set can provide stable and high-quality training data.

The philosophy of artificial intelligence colleagues

Genie is designed to be an “agent” model that can make the most logical response to a situation it sees. To achieve this, the team developed training data that represents this logic, allowing Genie to discover and exploit implicit information in code, just like a human developer, rather than relying on simple prompts.

Performance of Genie

Genie demonstrates its superior performance in multiple key benchmarks, especially in the field of software engineering.

SWE-Bench Test

Genie achieved a score of 30.08% on the SWE-Bench test, which represents the current best result in the industry. Genie's performance improved by 57% compared to the previous best result. For example, Amazon's Q and Code Factory scored 19%, while OpenAI's GPT-4 scored only 1.31% on the same test. This is the highest score achieved by any company to date and the largest single SOTA score improvement in the history of the benchmark.

SWE-Lite Test

In the SWE-Lite test, Genie also performed well, achieving a score of 50.67%, further demonstrating its ability to handle complex programming tasks.

Information retrieval ability

Genie is able to effectively retrieve and find the lines of code needed to solve the problem. In the test, it successfully retrieved 91,475 lines of 142,338 lines of code, with a score of 64.27%. Although there is still room for improvement in this area, Genie still performs well in problem decomposition and code debugging.

Comparison with other models

Genie outperforms other AI models, especially in dealing with complex problems and reasoning tasks. Unlike other models, Genie achieves superior results by not simply prompting the underlying model, but by specifically training it to mimic the logical thinking and decision-making process of human engineers.

Technical Report of Genie

Key Takeaways

Genie is the world's most advanced software engineering model, achieving the highest score of 30.08% in the SWE-Bench evaluation and 50.67% in SWE-Lite.

Genie is trained on proprietary data that captures the human reasoning process, showing the complete spectrum of information, incremental knowledge discovery, and step-by-step decision-making in the real work of software engineers. As a result, Genie learns to reason logically like human engineers, which makes it outperform by oversampling through multiple off-the-shelf large language models.

By actually training Genie, rather than simply giving prompts to a base model (which is what other AI tools do), we found that Genie was able to respond to diverse, highly contextual, and never-before-seen questions in the same way that humans do.

Introducing Genie

Genie is the world's most powerful software engineering model, as evaluated by SWE-Bench, and is Cosine's latest innovation in the field of AI-driven development. It is designed to simulate the cognitive processes of human engineers, enabling them to solve complex problems with unprecedented accuracy and efficiency.

Genie is the world's first AI software engineering colleague, trained on data that perfectly simulates the cognitive processes, logic, and workflow of human engineers. Our proprietary technology generates data that covers a complete spectrum of information, incremental knowledge discovery, and step-by-step decision-making. This allows Genie to break through the limitations of other AI software tools that are just basic models with some additional tools (such as web browsers or code interpreters). Genie is able to solve problems never seen before and iterate and test its outputs in the logical way of human engineers.

Genie is the world’s strongest software engineering AI, and we achieved 30.07% as evaluated on SWE-Bench, the industry standard for evaluating the software engineering skills of AI models. This represents a 57% improvement over the previous best score of 19%, held by Amazon’s Q and Code Factory (for comparison, OpenAI’s GPT-4 scored 1.31%). This is the highest score achieved by any company to date, and the largest single SOTA score improvement in the history of the benchmark. As part of the latest release, we have observed that Genie’s enhanced reasoning and planning capabilities generalize well beyond the software engineering domain, and we are committed to rigorous and careful red team testing.

Evaluate

During the development process, we used two core benchmarks to evaluate the performance of the model - SWE-Bench and HumanEval . The former is the best test for evaluating the ability of the model to solve software engineering problems, covering the disciplines of decomposing problems, finding relevant code, evaluating code, and implementing working solutions. The latter is more focused on writing code, does not involve the retrieval part, and reduces the emphasis on problem understanding.

We also benchmarked the model’s information retrieval capabilities for the task, specifically its ability to retrieve the correct portion of the code file that needed to be changed. This is one of the core components of AI Engineer - if the model can’t reliably find the correct code that needs to be edited, then its code editing capabilities will be impaired. We measure this in a very simple way, by looking at how many lines of code the model actually finds out of the number of lines it needs to find to complete the task. Genie successfully retrieved 142,338 of the 91,475 lines, for a score of 64.27%. There is clearly a lot of room for improvement here, and this is an aspect of our problem-solving capabilities that we focus less on, assuming that the code found is indeed correct.

Architecture

When we started building Genie, we were only able to fine-tune models for relatively short context windows, ranging between 16-32k. We did a lot of early exploration on these models, training them on large datasets of 100M+ tokens, and quickly realized that the architecture we came up with had its advantages, but had fundamental limits on how much information the model could represent at any one time. After trying multiple compression/chunking methods, we concluded that the only solution was to use larger context models, even though none were available for training at the time. Fortunately, we soon gained the ability to train long-context OpenAI models, which gave us the opportunity to really understand the potential of Genie. In its most recent training run, Genie was trained on billions of tokens of data, a combination of which was designed to make the model as competent as possible on the languages that current users care about most. One of the biggest areas we will be expanding on is this combination of data, and ideally we want it to be as close to the real distribution of Internet programming languages as possible, as we don’t want this to be just a subjective choice.

language	Data combination percentage
JavaScript	twenty one%
Python	twenty one%
TypeScript	14%
TSX	14%
Java	3%
C#	3%
C++	3%
C	3%
Rust	3%
Scala	3%
Kotlin	3%
Swift	3%
Golang	3%
PHP	3%
Ruby	3%

Example Type	Data combination percentage
Feature Development	25%
Bug Fixes	20%
Refactoring	15%
Minor changes and chores	15%
Test Writing	15%
Document writing and updating	10%

Genie was designed from the ground up to be an “agent,” although that term wasn’t really established in the industry when we first proposed the idea in late 2022. Fundamentally, we want Genie to react to what it sees and process it in the most logical way possible, and we need a dataset that represents this. One of the biggest challenges to overcome is determining the prior information needed to perform a task in an unknown codebase - it’s rare that you can just modify a file individually without understanding how the project works, so for each task we train, we have to first show the model the process of finding this prior information so that it doesn’t generate code out of thin air and it generates solutions that fit the way the codebase is organized and operates. This is just the tip of the iceberg of the work we’re doing to make the implicit information in the developer’s mind as explicit as possible, but we’re very focused on this and have already taken steps to build the next version of this pipeline after evaluating the current Genie model.

In terms of Genie's reasoning features, we wanted to keep it as simple as possible, with the agent loop consisting of four main processes: planning, retrieval, code writing, and code running. These are not new, and most such tools use a combination of these or all of these processes, however, because Genie is trained to perform each task like a human, rather than like a base large language model, we are able to extract more performance from the model.

One of the most notable performance gains came from our use of self-improvement in model training. Much of the data we trained on was in a “perfect” state, since the vast majority of code released by humans is in a working state before release. This meant that initially Genie had never actually seen an error, and was poor at detecting its own mistakes. Fortunately, after training the first version of Genie, we were able to use it to generate synthetic data that we injected into the dataset for the next version of the model — since we had the final state of the task from the training dataset, we could use the previous version of Genie to propose a solution, and then if it was wrong, we could use the final state to show how to get from the wrong state to the correct state. Each time we repeated this process, Genie’s initial candidate solutions became stronger, correct in many cases, and in the cases where they were wrong, the amount of corrections in the dataset that needed to be shown to the model was greatly reduced.

Future Outlook

We are continuing our efforts to revolutionize the technology team with Genie. Our main focus is to balance the delivery of real products that solve user problems with cutting-edge research that drives our progress. Although Genie already performs very well, we know there is untapped potential, and we are committed to improving the dataset to enhance Genie's capabilities. By broadening the data and introducing new features, Genie will become more proficient, covering more programming languages and the latest frameworks, directly meeting the needs of developers at work.

We are expanding our model portfolio to include smaller models for simple tasks and larger models for complex challenges, leveraging our unique dataset. This allows us to convert any state-of-the-art base model into a Genie model. Our plans include contextual extensions of open source models and pre-training of base models on our massive dataset, aiming to improve generalization and tuning of specialized data. One of the most exciting developments is fine-tuning Genie on a specific codebase, an enterprise feature that enables Genie to achieve perfect understanding even on large legacy codebases written in less popular or proprietary languages. As we continue to improve Genie, we will continuously release updates to customers, optimize interactions with this artificial colleague and collect valuable feedback. Our journey to encode human reasoning processes into any job begins with software engineering, and we can’t wait to show our progress.

👍🏼

Invest in your future with our affordable servers