The editor of Downcodes reports: OpenAI’s newly released chatbot system topped the list in recent evaluations. It performs well in terms of overall performance, security and technical capabilities, especially in STEM tasks. However, it is worth noting that the number of ratings participating in this evaluation was relatively low, which may have a certain impact on the final results and needs to be interpreted with caution.
OpenAI’s new system achieved excellent results in recent evaluations, taking the top spot in the chatbot rankings. However, due to the low number of ratings, this may skew the assessment results.
According to the release's overview, the new systems performed well in all assessment categories, including overall performance, safety and technical capabilities. One of the systems, dedicated to STEM tasks, briefly ranked second and took the lead in the technology field, together with the GPT-4o version released in early September.
Chatbot Arena, a platform for comparing different systems, evaluates new systems using over 6,000 community ratings. The results showed that these new systems performed well on mathematical tasks, complex prompts, and programming.
However, these new systems receive much lower ratings than other mature systems such as GPT-4o or Anthropic's Claude3.5, each with less than 3,000 reviews. Such a small sample size may skew the assessment and limit the significance of the results.
OpenAI's new system excels at math and coding, which were the main goals of its design. By "thinking" longer before answering, these systems aim to set new standards for AI reasoning. However, these systems do not outperform others in all areas. Many tasks do not require complex logical reasoning, and sometimes a quick response from other systems is enough.
Lmsys' chart on mathematical model strength clearly shows that these new systems scored over 1360, well above the performance of other systems.
Despite the limited sample size, the excellent performance of OpenAI's new system is still worthy of attention. Its breakthroughs in the fields of mathematics and coding provide a new direction for the development of AI reasoning technology. In the future, with the accumulation of more data and the continuous improvement of models, OpenAI's new system is expected to demonstrate its powerful capabilities in more fields. The editor of Downcodes will continue to pay attention to its development.