Cornell University

09/09/2024 | Press release | Distributed by Public on 09/09/2024 08:41

Brevity is money when using AI for data analysis

It pays to be brief when asking artificial intelligence tools to mine massive datasets for insights, according to Cornell researcher Immanuel Trummer.

That's why Trummer, associate professor of computer science in the Cornell Ann S. Bowers College of Computing and Information Science, has developed a new computational system, called Schemonic, that cuts the costs of using large language models (LLMs) such as ChatGPT and Google Bard by combing large datasets and generating what amounts to "CliffsNotes" versions of data that the models can understand. Using Schemonic cuts costs of using LLMs as much as tenfold, Trummer said.

"The monetary fees associated with using large language models are non-negligible," said Trummer, the author of "Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language Models," which was presented at the 50th Conference of Very Large Databases (VLDB) held Aug. 26-30 in Guangzhou, China. "I think it's a problem everyone who is using these models has."

LLMs are the powerful algorithms behind generative AI, and they've advanced to the point where they can process large datasets and show - by way of the computer code they generate - where to find patterns and insights in data. Even those without technical backgrounds can leverage these tools, Trummer said.

But getting LLMs to understand and process large datasets is tricky and potentially costly, since companies behind the models charge processing fees based on the number of individual "tokens" - words and numbers - within a dataset. A large dataset can have billions of tokens or more, and charges rack up each time users query the LLM, Trummer said.

"If you have hundreds of thousands of users who all submit many questions about your dataset, you pay the cost of reading the data description for each request over and over again," said Trummer, whose research explores how to make data analysis more efficient and user friendly. "The costs can quickly ramp up."

The key is providing the LLM some concise direction, in as few tokens as possible, on what the dataset contains and how it's organized, he said.

That's where Schemonic comes in. Its abbreviated descriptions of the database structure are enough for LLMs to work their magic at a fraction of the cost, he said.

"Schemonic basically detects a pattern about the data structure that could be summarized concisely," he said. "This approach compresses structured data optimally in order to minimize the amount of money you'd have to pay."

There can often be a quality tradeoff when compressing information, but Schemonic's generated descriptions are guaranteed to be semantically correct, Trummer said. Further, state-of-the-art LLMs like OpenAI's GPT-4 model can understand abbreviated descriptions from Schemonic without any negative impact on their output quality, he said.

"There are many use cases for LLMs in data analysis, ranging from translating questions about the data into formal queries, extracting tabular data from text, to finding semantic relationships between different datasets," Trummer said. "All require you to describe the data structure to the LLM, so Schemonic helps you save money in all of these use cases."

Louis DiPietro is a writer for the Cornell Ann S. Bowers College of Computing and Information Science.