Stony Brook University

11/21/2024 | News release | Distributed by Public on 11/21/2024 11:00

Study Finds NYT Connections Game Defeats Best AI Models

A study conducted by Tuhin Chakrabarty, assistant professor in the Department of Computer Science at Stony Brook, along with a team of researchers at Columbia University, discovered that the New York Times word game 'Connections' can serve as a challenging benchmark for training Large Language Models (LLM) in abstract reasoning.

While AI and machine learning regularly beat the world's greatest chess players, the study found that when it comes to 'Connections,' even the best-performing LLM, Claude 3.5 Sonnect, can fully solve only 18% of the games. The study looked at AI's response to over 400 Connections games andfound that both novice and expert players do better than AI at solving the puzzle.

In the game, players are presented with a 4×4 grid containing 16 words. The task is to group these words into four clusters of four words each, depending on their shared characteristics. For example, the words 'Followers,' Sheep,' 'Puppets,' and 'Lemmings' form one group, because they can be categorized as 'Conformists.'

To group words across proper categories, a player must be able to reason with various forms of knowledge - from semantic knowledge (about 'conformists') to encyclopedic knowledge.

Tuhin Chakrabarty

"While the task might seem easy to some, many of these words can be easily grouped into several other categories," Chakrabarty said. "For example, 'Likes,' 'Followers,' 'Shares,' and 'Insult' might be categorized as 'Social Media Interactions' at first glance." These possible groupings become red herrings. The game is designed with this in mind; that's what makes it more interesting.

The research also found that LLMs are relatively better at reasoning that involves semantic relations ('happy,' 'glad,' 'joyful') but they struggle with other types of knowledge, such as multiword expressions ('to kick the bucket' means 'to die') and combined knowledge about word form and word meaning (adding the prefix 'un-' to the verb 'do' creates the word 'undo' with the opposite meaning).

The study tested five LLMs - Google's Gemini 1.5 Pro, Anthropic's Claude 3.5 Sonnet, OpenAI's GPT4 Omni, Meta's Llama 3.1 405B, and Mistral Large 2 (Mistral-AI, 2024) - on 438 NYT Connections games, and compared the results with human performance on a subset of these games. Results showed that while all LLMs could partially solve some games, "their performance was far from ideal."

Read the full story at the AI Innovation Institute website.