Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Subscribe to our daily and weekly newsletters for the latest updates and content from the industry’s leading AI site. learn more
The latest version of OpenAI o3 example has achieved a breakthrough that has stunned the AI research community. The o3 achieved an unprecedented 75.7% on the most demanding ARC-AGI benchmark under the same conditions, with the highest performance system ever reaching 87.5%.
Although the achievement in ARC-AGI is interesting, it does not prove that the code Artificial General Intelligence (AGI) is broken.
The ARC-AGI benchmark is based on Abstract Discussion Corpuswhich tests AI’s ability to adapt to new tasks and demonstrate fluid intelligence. ARC contains groups of images that are required to understand key concepts such as objects, boundaries and interrelationships. While humans can easily overcome ARC challenges with minimal exposure, current AI systems struggle with them. ARC has long been seen as one of the most challenging aspects of AI.
ARC is designed in such a way that it cannot be fooled by training samples on tens of thousands of samples and hoping to cover all the combinations.
The indicator is made up of a group study with 400 easy examples. AI systems. The ARC-AGI Challenge consists of private and confidential tests of 100 images each, which are not shared with the public. They are used to evaluate selected AI systems without running the risk of crowdsourcing and contaminating future systems with prior knowledge. In addition, the competition places a limit on the number of participants in the production to ensure that the problems are not solved through predatory methods.
o1-preview and o1 scored 32% on ARC-AGI. Another method developed by the researcher Jeremy Berman used a hybrid method, combining Claude 3.5 Sonnet with genetic algorithms and a code translator to achieve 53%, the highest figure before o3.
In a blog postFrançois Chollet, the creator of ARC, described the work of o3 as “an amazing and important increase in AI, showing the ability to change tasks that have not been seen before in GPT families.”
It is important to note that the use of more computing power on previous generations of models did not reach these results. In his case, it took 4 years for models to progress from 0% with GPT-3 in 2020 to only 5% with GPT-4o in early 2024. Although we don’t know much about the o3 architecture, we can be sure that it is. and no greater powers than the former.
“This is not just a small improvement, but a real success, showing the improvement of AI capabilities compared to the limitations of the LLM,” Chollet wrote. “o3 is a system that can adapt to unprecedented tasks, approaching the way people work in the ARC-AGI community.”
It is worth noting that the o3 performance on the ARC-AGI comes at a high cost. At the lowest settings, it takes a model of $17 to $20 and 33 million symbols to solve each image, while at the highest budget, the model uses around 172X more and billions of symbols for each problem. However, as the show money continues to decline, we can expect these numbers to be reasonable.
The key to solving new problems is what Chollet and other scientists call “software synthesis.” A thinking system should create small programs to solve specific problems, then combine these programs to solve complex problems. Ancient languages have learned a lot and have a lot of internal programming. But they lack integration, which prevents them from figuring out puzzles that don’t go beyond their distribution.
Unfortunately, there is little information about how o3 works under the hood, and here, the opinions of scientists differ. Chollet thinks that o3 uses the type of software that it uses chain-of-mind (CoT) reasoning is a research method combined with a reward model that evaluates and controls responses as the model generates tokens. This is similar to that open source ideas examples they have been researching for the past few months.
Some scientists like Nathan Lambert from the Allen Institute for AI suggests that “o1 and o3 may be just a step forward from the same language.” On the day o3 was announced, Nat McAleese, a researcher at OpenAI, written on X that o1 was “just an LLM taught by RL. o3 is driven by expanding RL beyond o1.”
On the same day, Denny Zhou from Google’s DeepMind think tank called the integration of search and recent reinforcement learning approaching a “dead end.”
“The most beautiful thing about the LLM concept is that the concepts are created in an independent way, instead of relying on searches (for example, mcts) for generations, whether it is a well-prepared model or a quick one,” he said. written on X.
Although the details of how o3 reasons may seem small compared to the success of ARC-AGI, it can clearly explain the paradigm shift in teaching LLMs. There is currently a debate as to whether the law to promote LLMs through teaching data and computing has hit a wall. Whether the scaling of the test time depends on the correct training data or the different architectures can determine the next step.
The name ARC-AGI is misleading and some equate it with eliminating AGI. However, Chollet emphasizes that “ARC-AGI is not an acid test for AGI.”
“Passing ARC-AGI is not the same as achieving AGI, and, frankly, I don’t think o3 is AGI yet,” he writes. “O3 still fails at simple tasks, which shows a huge difference with human intelligence.”
In addition, it is said that o3 cannot learn these skills by themselves and relies on external verifiers during reasoning and social chains during training.
Some scientists have pointed out the flaws in OpenAI’s reported results. For example, the model was successfully adapted to the ARC studies that were implemented to achieve the current results. “A translator should not need special ‘education’, either personally or for any job,” writes the scientist. Melanie Mitchell.
In order to confirm whether these models have the type of simulation and reasoning that the ARC benchmark was designed to measure, Mitchell wants to “see if this machine can be adapted to specific tasks or reasoning tasks using the same logic, but in other fields than ARC.”
Chollet and his team are currently working on a new benchmark that is difficult for o3, which can reduce its number to below 30% even on a high budget. Currently, people can solve 95% of the puzzles without any training.
“You will know that AGI has reached the point where the task of making simple tasks for ordinary people but difficult for AI is impossible,” writes Chollet.