Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

OpenAI Challenges Its New Model to Underperform Humans on Tests of ‘General Intelligence.’ What does that mean?


A new generation of Artificial Intelligence (AI) is just around the corner they got the results of the people on a test designed to measure “general intelligence”.

On December 20, OpenAI’s o3 system scored 85% on ARC-AGI benchmarkabove AI’s previous positive results of 55% and similar to the general population. It also scored well on the most difficult math test.

Developing artificial general intelligence, or AGI, is the stated goal of all major AI laboratories. At first glance, OpenAI appears to have taken steps towards this goal.

Although skepticism remains, many AI researchers and developers feel that something has changed. For many, the prospect of AGI now seems more real, faster and closer than expected. Are they right?

Generalization is intelligence

To understand the meaning of o3 results, you need to understand what the ARC-AGI test means. In technical terms, it is the test of the “model” of the AI ​​when adapting to something new – how many examples of new situations the system needs to see in order to evaluate its performance.

An AI system like ChatGPT (GPT-4) does not perform very well. It was “trained” on thousands of samples of human text, creating formal “rules” about possible combinations of words.

The results are very good for general tasks. It’s bad for unusual jobs, because it has less data (very little data) about those jobs.

Until AI systems can learn from small samples and adapt to more samples, they will be used in iterations and where occasional failures are tolerated.

The ability to accurately solve previously unknown or new problems from a small sample of data is known as explanatory power. Many people consider it to be an important, even important, piece of wisdom.

Grids and patterns

The ARC-AGI benchmark test tests the model’s efficiency using small grid problems like the one below. The AI ​​needs to figure out the shape that turns the grid on the left into the grid on the right.

Several types of colored circles on a black background.
Sample application from the ARC-AGI benchmark test.
ARC award

Each question provides three examples to learn from. The AI ​​system then has to figure out the rules that “add up” from the third example to the fourth.

This is similar to the occasional IQ test you may remember from school.

Weak and flexible rules

We don’t know how OpenAI has done it, but the results show that the o3 model is very flexible. From just a few examples, it derives rules that can be explained.

In order to find a plan, we must not make unnecessary assumptions, or be more specific than we should be. In theoryif you can identify the “weak” rules that do what you want, then you’ve improved your ability to plan for new situations.

What do we mean by very weak rules? Technical interpretation is difficult, but weak laws are often what they might be are explained in simple terms.

In the example above, a plain English wording of this rule might be: “Any shape with an exit line can move to the end of that line and ‘cover’ any other shape it’s connected to.”

Looking for a chain of thought?

While we don’t know how OpenAI achieved this at this point, it seems unlikely that they deliberately modified the o3 system to find weak rules. However, to be successful in ARC-AGI projects you have to find them.

We know that OpenAI started with the o3 model (which differs from many other models, because it can spend more time “thinking” about complex questions) and trained it directly for the ARC-AGI test.

French AI researcher Francois Chollet, who created the benchmark, they believe o3 explores in different “concepts” to explain the solutions to this task. It then selects the “best” according to some well-known rule, or “heuristic”.

This can’t be “different” from how Google’s AlphaGo crawled through a variety of different algorithms to defeat the world champion.

You can think of these logic chains as programs that correspond to models. Of course, if it’s like a Go-playing AI, then it needs a strict, or loose, rule to decide which program is best.

There may be thousands of different seemingly identical programs out there. That heuristic can be “choose the weak” or “choose the easy”.

However, if it’s like AlphaGo then they had an AI to make a heuristic. This was AlphaGo’s method. Google trained the brand to consider different traffic as better or worse than others.

What we don’t know yet

The question is, is this close to AGI? If that’s how the o3 works, then the lower version can’t be better than the previous models.

The principles that the model learns from the language may not be more appropriate than before. In fact, we may be just seeing the familiar “ideas” that come through the extra sessions of expert training on this test. The proof, as always, will be in the pudding.

Almost everything about o3 is unknown. OpenAI has limited experience in a few radio shows and early testing for a few researchers, laboratories and AI security agencies.

Really understanding the potential of o3 will require a lot of work, including analysis, understanding the distribution of its power, how often it fails and how often it succeeds.

When the O3 is released, we’ll have a better idea if it’s nearly as flexible as a normal person.

If so, it could have a major, transformative, economic impact, ushering in a new era of fast-paced innovation. We will need new benchmarks for AGI itself and think deeply about how it should be managed.

If not, then this will still be an impressive result. However, everyday life will be the same.Discussion

Michael Timothy BennettPhD Student, School of Computing, Australian National University and Choose PerrierResearch Fellow, Stanford Center for Responsible Quantum Technology, Stanford University

This article was reprinted from Discussion under a Creative Commons license. Read the book the first story.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *