Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Will Smith is eating spaghetti and some amazing AI benchmarks that started in 2024


When a company released a new AI video generator, it didn’t take long for someone to use it to create a video of actor Will Smith eating spaghetti.

It’s become something of a meme and a symbol: Seeing if a new movie maker can get Smith to knock down a bowl of noodles. Smith himself parodied that’s happening in an Instagram post in February.

Will Smith and pasta is one of several strange “unacceptable” signs. taking the AI ​​community by storm in 2024. A 16-year-old developer created a program that gives AI control over Minecraft and tests its design skills. Elsewhere, a British developer has created a platform where AI plays games like Pictionary and Connect 4 against each other.

It’s not like there aren’t a lot of AI training tests. So why did the weirdos blow up?

Photo of LLM
Image credit:Paul Calcraft

For one thing, most AI benchmarks don’t tell the average person very much. Companies often tout their AI’s ability to answer questions on Math Olympiad exams, or find logical answers to Ph.D. problems. However many people – yours truly included – use chatbots for things like responding to emails and necessary surveys.

Crowdsourced corporate actions are not always better or more informative.

Take for example, Chatbot Arenaa public symbol many AI enthusiasts and developers follow closely. Chatbot Arena allows anyone on the internet to see how AI works for specific tasks, such as creating a web app or creating an image. But pollsters tend to be non-representative — many come from the AI ​​and technology sectors — and cast votes based on preferences, which are hard to pin down.

The value of LMSYS
Features of the Chatbot Arena.Image credit:The value of LMSYS

Ethan Mollick, a professor of management at Wharton, recently said in a post on X another problem with many of the industry’s AI benchmarks: they don’t compare the performance of the system to that of a normal human.

“The fact that there are not 30 different indicators from different medical organizations, laws, ethical advice, etc. is a shame, since people are using systems for these things, regardless,” Mollick wrote.

Strange AI symbols like Connect 4, Minecraft, and Will Smith eating spaghetti are real. no empirical – or even completely generalizable. Just because an AI nails Will Smith’s test doesn’t mean it’s going to make, say, a burger joint.

Mcbench
Note the typo; no example like Claude 3.6 Sonnet.Image credit:Adonis Singh

One expert I spoke to about AI benchmarks pointed out that the AI ​​community tends to focus on AI’s slowness rather than its capabilities at lower levels. That’s understandable. But I have a feeling the strange symptoms won’t go away anytime soon. Not only are they fun – who doesn’t love watching an AI build Minecraft houses? – but they are easy to understand. And like my friend Max Zeff wrote about recentlyThe industry continues to struggle with distilling complex technology such as AI into the marketing mix.

The only question in my mind is, which new benchmarks will be infected in 2025?





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *