San Francisco AI startup Cosine has released its latest AI model, Genie, which is designed for software developers and performs well in benchmark tests, scoring well above its competitors. Cosine leverages GPT-4o variants trained in partnership with OpenAI, and through its unique "Coded Human Reasoning" capabilities, enables Genie to complete a variety of programming tasks autonomously or collaboratively, including fixing bugs, developing new features, and refactoring code. Genie's success is also inseparable from Cosine's unique data training method and clever use of the model's self-improvement mechanism, which ultimately achieved a 30% leading score in the SWE-Bench test.
San Francisco-based AI startup Cosine has launched a new AI model called Genie designed to assist software developers. According to the company, Genie far outperformed competitors in benchmark tests, demonstrating superior capabilities.
Cosine partnered with OpenAI to train a GPT-4o variant using high-quality data, achieving impressive benchmark results. The company says the key to Genie's success is its ability to "encode human reasoning," which may not be limited to the world of software development.
Genie takes the lead in SWE
Cosine co-founder and CEO Alistair Pullen revealed that Genie achieved a score of 30% in the SWE-Bench test, which is the highest score so far for an AI model in the field. This score surpasses other coding-focused language models, such as Amazon’s model (19%) and Cognition’s Devin (13.8% in some tests of SWE-Bench).
Genie's architecture is designed to simulate the cognitive processes of human developers, enabling it to fix bugs, develop new features, refactor code, and perform a variety of programming tasks autonomously or collaboratively.
Self-improvement through synthetic data
Genie was developed using a proprietary process that trained and fine-tuned non-public GPT-40 variants using billions of high-quality data. Cosine spent nearly a year collating this data with the help of experienced developers. The data set contains 21% JavaScript and Python, 14% TypeScript and TSX, and 3% other languages including Java, C++ and Ruby).
Genie's superior performance is due in part to its self-improvement training. Initially, the model learned mostly from perfect, working code, but was confused about its own error handling. Cosine solves this problem by using synthetic data: if the solution Genie originally proposed was incorrect, the model is shown how to improve with the correct results. With each iteration, Genie's solution gradually improved and the number of revisions required gradually decreased.
Overcome technical limitations
Pullen saw the potential of large language models in supporting human software development as early as early 2022. However, technology at the time was not yet at the level to realize Genie's vision. The mark capacity of the context window is usually limited to 4000 marks, which is a major bottleneck. Today, models such as the Gemini 1.5 Pro can handle up to 2 million markers in a single prompt. Although Cosine has not disclosed Genie’s specific labeling capacity, this technological advancement undoubtedly provides a solid foundation for Genie’s success.
The emergence of Genie marks a major breakthrough in the field of AI-assisted software development. Its efficient coding capabilities and self-learning mechanism provide new possibilities for future software development. Cosine's innovative technology provides new ideas for improving software development efficiency and reducing development costs, and deserves the industry's attention and further research.