07/01/2024 | News release | Distributed by Public on 07/02/2024 02:37
The concept of "garbage in, garbage out" (GIGO) has never been more relevant in our increasingly AI-driven world. GIGO applies when poor quality input results in poor quality output; and some of the early experiences with GenAI are a perfect example of this. Regardless of the type of AI solution used, the solution will always require complete, accurate, and timely data to deliver trusted outcomes.
As organizations across industries increasingly invest in AI, using the right data at the right time and in the right (and responsible) way is becoming more challenging. In this blog, I share a few of these challenges and propose a way forward-the use of synthetic (or artificially produced) data for AI.
AI and machine learning engines require vast amounts of data to be trained so they can perform their intended tasks. While data volume typically is not an issue, data usage is another story. Three problems associated with using organic data for AI are worth discussing.
Privacy. First, data usage is subject to significant regulation focused on protecting individual privacy. A clear example is the European Union's General Data Protection Regulation (GDPR). GDPR aims to ensure that personal information is handled in a responsible and secure manner, while also giving individuals more control over their data. It limits how data can be collected and used, as well as how long it can be stored. Because of regulations like GDPR, customer and employee data cannot be freely used to train AI engines. To legally use individual-based data, extensive anonymization is often required, which is both complicated and expensive; and data anonymization does not guarantee security.
Copyright. Second, some data is copyrighted. There is much debate about the use of copyrighted data for AI. Given regulatory discussions happening across various government entities, we anticipate new guidelines to be released soon that will require organizations to clearly indicate which public data has been used to train a specific AI function or module.
Quality. Third, data can include a range of errors and biases, which may or may not be easy to correct, even if the data is organically produced. Further, identifying high-quality data among the volumes of available data can be burdensome and costly.
An alternative to organically produced data is synthetic data. Unlike data collected from real events, synthetic data is artificially generated. However, it offers the same statistical properties as organic data and therefore provides the same statistical conclusions. This makes it very useful for AI solutions.
Synthetic data can be generated programmatically using a variety of techniques. With machine learning, for example, it's possible to produce synthetic data that mirrors the statistical properties of real-world data. Data also can be collected from real-life people, events, or objects via computer simulations or algorithms and converted to synthetic data. Data scientists take the real-world data, extract desired information, and convert it into synthetic datasets.
Because there are no limitations on the type or size of synthetic data that can be generated-either from real-world data, including images, or from scratch-potential use cases abound. Synthetic data can be generated, for example, in healthcare to support research and development without compromising real-life patient data. It can be used in industries like retail and transportation to statistically mirror customer behavior and drive product and service innovation.
The use of synthetic data, however, comes with challenges. It may not be as precise as real-world data or perfectly reflect real-world scenarios. For example, outliers and low probability events, common in real-world datasets, are difficult to reproduce in synthetic data.
Synthetic data also can pose a security risk when used to support AI models. Malicious use of synthetic datasets, for example, can lead to AI models that are more vulnerable to security attacks.
Synthetic data is an exciting area of AI that resolves some of the biggest challenges in data management, such as privacy, data availability, and quality. Synthetic data can open new opportunities for exploring AI innovations, while maintaining a high level of data protection and regulatory compliance. use
For organizations evaluating the of synthetic data, we recommend the following:
The quality of synthetic data is an area in which I'm particularly interested. CGI is working on a new project related to this that is quite exciting. Feel free to contact me to discuss synthetic data, data usage, or AI in general. You also can explore CGI's AI capabilities and experience at cgi.com.
Jonas Forsman has more than 20 years of experience in designing, developing, testing, and implementing advanced technology solutions across industries using artificial intelligence, big data, data analytics, and business intelligence. He also has significant experience in research and innovation project management both within and outside ...