Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Last month, AI startups and investors told TechCrunch that we are now in “the second term of the scaling law,” noting how established methods of controlling AI species were showing diminishing returns. One of the new ways they promised to benefit was “test-time scaling,” which seems to have made the work of OpenAI version of o3 – but it comes with its own challenges.
Many in the AI world have heralded OpenAI’s o3 model as proof that AI growth has yet to “hit a wall.” The o3 model performs well in benchmarks, outperforming all other models in a skill test called ARC-AGI, scoring 25% on difficult math test that no other AI model has won 2% on.
Of course, we at TechCrunch are taking all of this with a grain of salt until we can try o3 for ourselves (very few have tried it so far). But even before the release of o3, the world of AI has already confirmed that something big has changed.
Co-developer of OpenAI’s o-series models, Noam Brown, said on Friday that the startup will announce the o3 breakthrough just three months after the startup announced o1 – a short jump time.
“We have every reason to believe that this will continue,” Brown said tweet.
Anthropic co-founder Jack Clark said in a blog post Monday that o3 is proof that AI “advances will be faster in 2025 than in 2024.” (Remember that it benefits Anthropic — especially its ability to raise money — to say that the rules of AI are still going strong, even if Clark is helping the competition.)
In the coming year, Clark says the AI world will combine time-based scaling with traditional pre-learning methods to get the most out of AI models. Maybe they are suggesting that Anthropic and other AI providers will release their own models in 2025, like Google did it last week.
Increasing the test time means that OpenAI is using more computing power during the ChatGPT process, the longer you press enter the faster. It’s not clear what’s going on behind the scenes: OpenAI is either using more computer chips to answer the user’s question, using more powerful chips, or running these chips for longer periods of time – 10 to 15 minutes in some cases – before it happens. AI creates a solution. We don’t know the details of how o3 was designed, but these signs are the first signs that the time trial will work to make the AI models work better.
Although o3 may offer new faith in the flow of AI rules, the new version of OpenAI also uses an unprecedented level of computation, which means higher costs for each solution.
“Perhaps the only caveat here is to understand that one of the reasons why O3 is so good is that it takes a lot of money to be able to do it during the reference period – the ability to use the test time means for some problems you can make it a good solution. , ” Clark writes in his blog. “This is exciting because it has made the cost of running an AI machine less predictable – previously, you could tell how much it would cost to run a production model just by looking at the model and production cost of the given data.”
Clark, and others, pointed to o3’s performance on the ARC-AGI benchmark — a critical test used to assess the progress of AGI — as an indicator of its progress. It is important to note that passing this test, according to its creators, does not mean the quality of the AI you have succeeded AGI, but instead is one way to determine the progress of an unpleasant goal. That said, the o3 model outperformed all previous AI models it tested, scoring 88% in its test. OpenAI’s next AI model, o1, only scored 32%.
But the logarithmic x-axis on this chart can be intimidating to some. The most advanced version of o3 uses more than $1,000 worth of compute for each job. The o1 models use about $5 of compute per application, and the o1-mini uses a few cents.
The creator of the ARC-AGI benchmark, François Chollet, writes in a blog that OpenAI used about 170x the calculation to produce 88%, compared to the advanced version of o3 which was only 12% lower. The most advanced version of o3 spent more than $10,000 to complete the test, which makes it very expensive to compete in the ARC Award – an unbeatable competition for AI models to beat the ARC test.
However, Chollet says the o3 was the best of the AI models, however.
“O3 is a system that can adapt to unprecedented tasks, approaching the way people work in the ARC-AGI community,” Chollet said in a blog post. “Of course, this kind of scale comes at a high price, and it may not be cost-effective yet: You can pay someone to solve ARC-AGI tasks for about $5 per task (we know, we did it), and consume only cents in energy.”
Before I can talk about the exact prices of all of this – we’ve seen the prices of AI models drop over the last year, and OpenAI still hasn’t said how much o3 will cost. However, these rates show the amount of computing power required to break, even slightly, the barriers set by today’s leading AI models.
This raises some questions. What exactly is o3? And how much is needed to make a profit around o4, o5, or whatever OpenAI calls its next models?
It doesn’t seem like o3, or its successors, can be a “daily driver” like GPT-4o or Google Search can be. These models use more computers to answer small questions throughout the day like, “How can the Cleveland Browns still make the 2024 playoffs?”
In fact, it seems like AI models with a timed trial run would be great for big picture questions like, “How will the Cleveland Browns become a Super Bowl franchise in 2027?” Even so, it’s probably worth it if you’re the general manager of the Cleveland Browns, and you’re using these tools to make big decisions.
Corporations with deep pockets may be the only ones who can afford o3, at least to begin with, as Wharton professor Ethan Mollick points out in tweet.
We have already seen OpenAI release a $200 tier to use the highest quality version of o1but the basics are there he reportedly got rich creating subscription plans that cost up to $2,000. When you look at the number of computations o3, you can understand why OpenAI would consider it.
But there are limitations to using o3 for high-impact applications. As Chollet points out, o3 is not AGI, and it still fails at simple things that a human can easily do.
This is not surprising, as the main types of languages he still has severe hallucinationswhat o3 is the trial period does not seem to have been resolved. This is why ChatGPT and Gemini include a disclaimer below every answer they provide, asking users not to rely on their own opinions. Obviously AGI, if it were to be reached, would not require such a waiver.
One way to unlock more gains in test-time can be better AI chips. There is no shortage of startups that are just tackling this thing, such as Groq or Cerebras, while other startups are developing low-cost AI chips, such as MatX. Andreessen Horowitz senior partner Anjney Midha told TechCrunch she he hopes that these initiatives will play a major role in the time scale of the test progress.
Although o3 is a popular feature in the field of AI models, it raises several new questions about its applicability and cost. That said, o3’s performance adds credence to its claim that real-time computing is the best way to model AI models.
TechCrunch has a newsletter focused on AI! Log in here to get it in your inbox every Wednesday.