IBM - International Business Machines Corporation

19/08/2024 | News release | Distributed by Public on 20/08/2024 21:07

Introducing KVP10k: A comprehensive dataset for key-value pair extraction in business documentsAI

19 Aug 2024
Technical note
3 minute read

Introducing KVP10k: A comprehensive dataset for key-value pair extraction in business documents

Developed by a team at IBM Research, the open-source KVP10k dataset is designed to tackle one of the most challenging problems in the field: extracting key-value pairs from a wide range of business documents without predefined keys. The KVP10k dataset will be showcased at the 2024 ICDAR conference.

What exactly are key-value pairs and why are they important?

Key-value pairs (KVPs) are a data representation method where each "key" is linked to a specific "value." In the context of business documents like invoices, purchase orders, or forms, key-value pairs are used to organize and store information efficiently. The key represents the identifier or label (such as address, invoice number, or total amount) and the value provides the actual data corresponding to that key (for example, the street address, a unique invoice identifier, or the amount due). While there can be cases where multiple values are connected to one key, our focus is primarily on a one-to-one matching of keys to values.

KVPs allow us to process structured data from various document formats, improving efficiency and reducing errors. They support automated data entry, enable interoperability between different business systems, and facilitate comprehensive data analysis and reporting. This makes KVPs a requirement for modern, data-driven businesses aiming for operational efficiency and accurate data management.

What sets KVP10k apart?

Unlike existing datasets that focus on Key Information Extraction (KIE) with predefined keys, KVP10k introduces a new challenge - extracting KVPs across a variety of templates and complex layouts, without relying on predefined keys. This opens up the possibility of truly dynamic information extraction, crucial for handling real-world documents that exist in the legal, financial, and healthcare sectors.

The dataset contains over 10,000 richly annotated images, making it one of the most extensive of its kind. Each image in KVP10k has been meticulously annotated to provide both the keys and values along with their interrelations, reflecting real-world complexity and diversity of examples. This detailed annotation facilitates deeper learning and more accurate model training, crucial for developing robust document processing systems.

As shown in the figure below, KVP10k stands out among various datasets by providing a significantly higher number of documents, entities, keys, values, and links, enabling more comprehensive and effective training for document processing models.

A comparative overview of KVP10k versus other datasets, comparing the number of documents, entities, keys, values, and links.

Businesses today are inundated with vast amounts of unstructured data, much of which comes in the form of documents like invoices, contracts, and reports. The ability to efficiently extract and use the information contained within these documents can significantly enhance decision-making and operational efficiency. KVP10k addresses this need by providing a dataset that mirrors the complexity of real-world documents, including variations in layout, terminology, and structure.

Research and practical applications

KVP10k isn't just a dataset - it's also a benchmark for evaluating the performance of information extraction models. It includes a challenging mix of elements from both KIE and KVP extraction tasks, offering a comprehensive framework for developing and testing new models. This makes it an invaluable resource for researchers aiming to push the boundaries of what's possible in document understanding technologies.

For practitioners in the field, the diverse and richly annotated dataset offers a realistic testing ground for refining algorithms and systems designed to process complex documents. By providing a broad array of document types and detailed annotations, KVP10k helps train models that are not only accurate, but also adaptable to various industries and document types. An example of the annotation is shown in the figure below.

The team created a fine-tuned version of the Mistral 7B AI language model using the KVP10k dataset. This ready-to-use model exemplifies the practical application of the dataset, offering a robust baseline for other developers to improve upon.

An example of an annotated image in the KVP10k dataset.

KVP10k sets a new standard for datasets in the domain of document information extraction. With its focus on non-predetermined KVP extraction and the inclusion of real-world document complexities, it offers a unique resource that promises to drive forward the state of the art in document analysis. As the open-source community begins to leverage KVP10k, we anticipate a new wave of technologies capable of transforming the landscape of business document processing.