dailyozkv.top

The New York Times 'Connections': A Challenging Benchmark for LLM Training

25.2K

1.3K

415

A study by Tuhin Chakrabarty, an assistant professor in the Department of Computer Science at Stony Brook, along with researchers from Columbia University, has uncovered an interesting finding. The New York Times word game 'Connections' emerges as a tough yardstick for training Large Language Models (LLM) in abstract reasoning. This discovery challenges the common perception that AI and machine learning always outshine human capabilities.

AI's Performance in 'Connections'

While AI often dominates in games like chess, the study reveals a different story in 'Connections'. Even the top-performing LLM, Claude 3.5 Sonnect, manages to solve only 18% of the games. The research analyzed AI's responses to over 400 'Connections' games and found that both novice and expert human players outperform AI in solving this puzzle. In the game, players are presented with a 4×4 grid of 16 words and tasked with grouping them into four clusters of four words each based on shared characteristics. For instance, 'Followers,' 'Sheep,' 'Puppets,' and 'Lemmings' form a group as they are 'Conformists.' To group words accurately, one needs to reason with various forms of knowledge, including semantic and encyclopedic knowledge.Tuhin Chakrabarty emphasizes that although the task may seem straightforward to some, many words can be grouped into multiple categories, creating red herrings. This is precisely what makes the game more engaging.

LLM's Strengths and Weaknesses

The research also highlights that LLMs are relatively better at reasoning involving semantic relations like 'happy,' 'glad,' and 'joyful.' However, they struggle with other types of knowledge such as multiword expressions like 'to kick the bucket' (meaning 'to die') and combined knowledge about word form and meaning (adding 'un-' to 'do' creates 'undo' with the opposite meaning).When tested on 438 NYT Connections games with five LLMs - Google's Gemini 1.5 Pro, Anthropic's Claude 3.5 Sonnet, OpenAI's GPT4 Omni, Meta's Llama 3.1 405B, and Mistral Large 2 - the results showed that while all LLMs could partially solve some games, their performance was far from perfect.Read the full story at the AI Innovation Institute website.

New