IBM - International Business Machines Corporation

08/20/2024 | News release | Distributed by Public on 08/19/2024 22:37

Examining synthetic data: The promise, risks and realities

As artificial intelligence reshapes industries worldwide, developers are grappling with an unexpected challenge: a shortage of high-quality, real-world data to train their increasingly sophisticated models. Now, a potential solution is emerging from an unlikely source-data that doesn't exist in reality at all.

Synthetic data, artificially generated information designed to mimic real-world scenarios, is rapidly gaining traction in AI development. It promises to overcome data bottlenecks, address privacy concerns, and reduce costs. However, as the field evolves, questions about its limitations and real-world impact are coming to the fore.

The rise of synthetic data

Tech giants are betting big on synthetic data. NVIDIA recently announced Nemotron-4 340B, a family of open models designed to generate synthetic data for training large language models (LLMs) across various industries. This move addresses a critical challenge in AI development: the prohibitively high cost and difficulty of accessing robust datasets.

"High-quality training data plays a critical role in the performance, accuracy and quality of responses from a custom LLM," NVIDIA wrote on its blog. The Nemotron-4 340B family includes base, instruct and reward models that form a pipeline for generating and refining synthetic data, potentially accelerating the development of powerful, domain-specific LLMs.

IBM researcher Akash Srivastava explains that in the context of large language models, synthetic data is often generated by one AI model to train or customize another. "Researchers and developers in the industry are using these models to generate data for particular target tasks," Srivastava notes.

Investigators from MIT-IBM Watson AI Lab and IBM Research recently introduced a new approach to improving LLMs using synthetic data. The method, called LAB (Large-scale Alignment for chatBots), aims to reduce reliance on human annotations and proprietary AI models like GPT-4.

Explore watsonx.ai

LAB employs a taxonomy-guided synthetic data generation process and a multi-phase training framework. The researchers report, "LAB-trained models can achieve competitive performance across several benchmarks compared to models trained with traditional human-annotated or GPT-4 generated synthetic data."

To demonstrate LAB's effectiveness, the team created two models, LABRADORITE-13B and MERLINITE-7B, which reportedly outperformed other fine-tuned versions of the same base models on several key metrics. The researchers used the open-source Mixtral model to generate synthetic training data, potentially offering a more cost-effective approach to enhancing LLMs.

The quality of synthetic data is crucial for its effectiveness. Raul Salles de Padua, Director of Engineering, AI and Quantum at Multiverse Computing, explains, "The fidelity of synthetic data is calculated by comparing it to real-world data through statistical and analytical tests. This includes an assessment of how well the synthetic data preserves key statistical properties, such as means, variances and correlations between variables."

Despite its promise, synthetic data isn't without challenges. De Padua points out, "The challenge with synthetic data is in creating data that is both useful and privacy-preserving. Without putting these safeguards in place, synthetic data could reveal personal details, potentially leading to identity theft, discrimination or other privacy violations."

Recent research has uncovered potential pitfalls in relying too heavily on synthetic data. A recent study published in Nature revealed a phenomenon called "model collapse." When AI models are repeatedly trained on AI-generated text, their outputs can become increasingly nonsensical, raising concerns about the long-term viability of using synthetic data, especially as AI-generated content becomes more prevalent online.

Ethical considerations also loom large. De Padua warns of the "risk of the synthetic data not accurately representing the diversity of the real-world population, producing potential bias in models that fail to perform equitably across different demographic groups."

The future of AI training

In critical applications like healthcare and autonomous vehicles, synthetic data can play a vital role. De Padua notes, "In healthcare, synthetic data can supplement real datasets, providing a wider range of scenarios for training models, leading to better diagnostic and predictive capabilities." For autonomous vehicles, he adds, "By using synthetic data for augmentation, models can be exposed to a wider range of conditions and edge cases that might not be present in the original dataset."

Looking to the future, de Padua believes synthetic data will likely supplement rather than replace real-world data in AI training. "The accuracy and representativeness of synthetic data are crucial. Technological advances in data generation algorithms will play a significant role in increasing the reliability of synthetic data," he explains.

As AI increasingly integrates into our daily lives, from healthcare diagnostics to self-driving cars, the balance between synthetic and real-world data in AI training will be crucial. The challenge for AI developers moving forward will be to harness the benefits of synthetic data while mitigating its risks.

"We're at a critical juncture in AI development," says Srivastava. "Getting the balance right between synthetic and real-world data will determine the future of AI-its capabilities, limitations and, ultimately, its impact on society."

Webinar: Scaling gen AI for your business
Was this article helpful?
YesNo
Tech Reporter, IBM