Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
OpenAI saved its biggest announcement for its last day 12-day “shipmas” event.
On Friday, the company unveiled the o3, the successor to the o1 “Conversation” model that was released at the beginning of the year. o3 is a model family, to put it mildly – as was the case with o1. There is the o3 and the o3-mini, a small, compact model optimized for specific applications.
OpenAI makes an interesting claim that o3, at least in some cases, comes close AGI – and great information. More on this below.
o3, our latest model, is a winner, with improved performance in our most demanding benchmarks. we are starting to test defense & red teaming now. https://t.co/4XlK1iHxFK
— Greg Brockman (@gdb) December 20, 2024
Why call the new model o3, not o2? Well, signs can be wrong. According to to The Information, OpenAI skipped o2 to avoid a potential conflict with British telecom provider O2. CEO Sam Altman confirmed this during a press conference this morning. Strange world we live in, isn’t it?
Neither o3 nor o3-mini is currently available, but security researchers can sign up to preview o3-mini starting today. The preview of o3 will come later; OpenAI did not specify when. Altman said the plan is to launch the o3-mini at the end of January and follow with the o3.
This contradicts a bit of what he said recently. In a interview this week, Altman said that, before OpenAI releases new types of reasoning, it would prefer a government testing system to ensure and reduce the risk of such types.
And there are dangers. AI security testers they will find that the thinking ability of o1 makes it possible to try to deceive users at a higher level than ordinary, “non-thinking” models – or, for that matter, advanced AI models from Meta, Anthropic, and Google. It is possible that o3 tries to cheat at a higher level than its founders; We’ll let you know when the OpenAI red team members release their test results.
For what it’s worth, OpenAI says it’s using a new approach, “intentional integration,” to align models like o3 with its own security principles. (o1 is connected in the same way.) The company has detailed its work in a a new lesson.
Unlike most AIs, virtual models such as o3 are self-diagnostic, which it helps them avoid some of the pitfalls that often attract moths.
This search results in latency. o3, like o1 before it, takes much longer – usually seconds to minutes longer – to arrive at answers compared to a conventional non-thinking model. A ride? It tends to be more reliable in areas like physics, science, and math.
o3 was taught through promote education “think” before responding through what OpenAI describes as “secret reasoning.” This model can think through work and plan for the future, doing several things over a long period of time that help find a solution.
We announced @OpenAI o1 three months ago. Today, we announced o3. We have every reason to believe that the problem will continue. pic.twitter.com/Ia0b63RXIk
– Noam Brown (@polynoamial) December 20, 2024
In response, when prompted, o3 pauses before responding, considering multiple stimuli and “explaining” his thoughts along the way. After a while, the model summarizes what it considers to be the most correct answer.
o1 was a great way to reason – as we explained in the original “Learning to Reason” blog, it is an “LLM” taught by RL. o3 is controlled by increasing RL beyond o1, and the power of the resulting model is very impressive. (2/n)
— Nat McAleese (@__nmca__) December 20, 2024
What’s new with the o3 versus the o1 is the ability to “adjust” the reflection time. These models can be set to low, medium, or high (eg, think time). The more computing power, the better it is for o3 to work.
No matter how many computers they have, ideas like o3 are flawless, however. When considering the region can be reduced it’s a nightmare it’s wrong, it doesn’t remove them. o1 goes to a game of tic-tac-toe, for example.
The big question coming up today was whether OpenAI would say that its new models are approaching AGI.
AGI, short for “artificial intelligence,” refers to an AI that can perform any task that a human can. OpenAI has its definition: “autonomous systems that outperform humans in economically important tasks.”
Achieving AGI would be a bold statement. And it carries the weight of OpenAI, as well. According to what they have done with their close partner and investor Microsoft, when OpenAI gets to AGI, it doesn’t have to give Microsoft access to the most advanced technologies (which meet the definition of OpenAI’s AGI, that is).
Going with one benchmark, OpenAI and slowly getting closer to AGI. On the ARC-AGI, a test designed to see if AI machines can successfully acquire new skills outside of what they’ve been taught, o3 scored 87.5% on top computers. At its worst (at the lowest setting), the model tripled the performance of the o1.
Of course, the advanced computer systems were very expensive – on the order of thousands of dollars per problem, according to co-developer ARC-AGI. François Chollet.
Today OpenAI announced o3, its next generation of logic. We’ve worked with OpenAI to test ARC-AGI, and we believe it represents a major breakthrough in making AI adaptable to modern tasks.
It scores 75.7% on a private test on a low-cost computer ($20 per application… pic.twitter.com/ESQ9CNVCEA
– François Chollet (@fchollet) December 20, 2024
Chollet also said that o3 fails on “very simple tasks” in ARC-AGI, showing – in his opinion – that this model shows “significant differences” from human intelligence. He said what was said before failure of the evaluation, and caution against using it as a measure of AI intelligence.
“(E) historical records show that the upcoming signal (the successor to ARC-AGI) will remain very difficult for o3, which can reduce its rate to 30% even on high-end computers (where a smart person can still. Get 95% without training) ,” Chollet continued to say. “You will know that AGI has reached the point where tasks that are easy for ordinary people but difficult for AI are impossible.”
Incidentally, OpenAI says it will partner with the foundation behind ARC-AGI to help develop the next generation of its AI benchmark, ARC-AGI 2.
On some tests, o3 beats the competition.
The model outperforms o1 by 22.8 percent on SWE-Bench Verified, a benchmark for software applications, and achieves a Codeforces score – another measure of coding ability – of 2727. (A score of 2400 puts an engineer at the 99.2nd percentile. ) o3 scored 96.7% on the 2024 American Invitational Mathematics Exam, missing only one question, achieving 87.7% on GPQA Diamond, a set of undergraduate questions in biology, physics, and chemistry. Finally, o3 sets a new record on EpochAI’s Frontier Math benchmark, solving 25.2% of the problems; no other example is more than 2%.
We trained o3-mini: both more capable than o1-mini, and about 4x faster end-to-end for token counting.
and @ren_hongyu @shengjia_zhao and others pic.twitter.com/3Cujxy6yCU
– Kevin Lu (@_kevinlu) December 20, 2024
These claims should be taken with a grain of salt, of course. They are based on OpenAI’s internal evaluation. We will have to wait to see how the brand will be in terms of customers and foreign organizations in the future.
After the release of the first OpenAI prototypes, there has been an explosion of prototypes from competing AI companies – including Google. In early November, DeepSeek, an AI research company backed by a growing number of entrepreneurs, launched a preview of its first concept, DeepSeek-R1. In the same month, Alibaba Group’s Qwen uncovered which was said to be the first “open” challenge to o1 (in the sense that it could be downloaded, modified, and run locally).
What paved the way for the imagination to reach water? Well, for one, the search for new ways to improve artificial AI. Like TechCrunch recently report“brute force” methods of sample expansion are no longer yielding the improvements they once started.
Not everyone is sure that conceptual models are the best way forward. They tend to be expensive, for one reason, because of the amount of computing power required to run them. And while they’ve performed well on benchmarks so far, it’s not clear whether the hypothetical models can keep up with this.
Interestingly, the release of o3 comes as one of OpenAI’s most successful scientists is leaving. Alec Radford, the lead author of the academic paper that launched OpenAI’s “GPT series” of AI models (that is, GPT-3, GPT-4, etc.), announced this week that he has a problem. leave conduct independent research.
TechCrunch has a newsletter focused on AI! Log in here to get it in your inbox every Wednesday.