20/11/2024 | News release | Distributed by Public on 20/11/2024 11:34
PatCID, a new database from IBM, uses document understanding models to search patents for molecular structures, helping businesses stay on top of what is state of the art, and opening new possibilities for materials discovery.
PatCID, a new database from IBM, uses document understanding models to search patents for molecular structures, helping businesses stay on top of what is state of the art, and opening new possibilities for materials discovery.
Discovering new materials is difficult enough, but today's chemists often encounter a vexing problem along the way: It's difficult to research what materials have already been described in patent documents because these documents are full of chemical structures that aren't described using chemical-structure names. With new patent documents being published every week, researchers face a formidable task when identifying the molecular structures contained in them.
To address these problems, a team of scientists at IBM Research has developed an open-access searchable database of chemical structures in patent documents. It's called PatCID (which stands for Patent-extracted Chemical-structure Images database for Discovery), and it contains 81 million structural images of 14 million unique chemical structures. The goal is to make it easier for today's chemists to develop tomorrow's materials.
In tests on a random set of molecules, PatCID outperformed four other chemistry databases, including both manually and automatically created ones. The team published its results in Nature Communications, and the tool is available on GitHub for anyone to try, thanks to IBM's Deep Search team.1 Japanese technology company and longtime IBM Research partner JSR Corporation has been working with the tool, using it to assess the patent landscape of their competitors in the semiconductor industry. PatCID builds on IBM Research's history of collaboration with JSR, which includes using quantum computing to simulate the behavior of new candidate molecules to be used in microchip manufacturing.
Performing this kind of competitive landscape research manually is not only tedious, but it often yields incomplete results. IBM Research scientists Lucas Morin, Valery Weber, Gerhard Ingmar Meijer, and their team developed PatCID to make the task both easier and more reliable. Their tool is powered by document understanding models that process patent documents in three steps: document segmentation, image classification, and chemical structure recognition. The first one, called DECIMER-Segmentation, locates chemical images in documents; the second, newly developed by this IBM Research team, is called MolClassifier and it classifies the molecular structure images; and the third, a tool called MolGrapher which the team released last year, creates graphs of the images and stores them in an industry standard format.2
When chemists start developing a new molecule, one of the first things they do is perform something called a prior-art search. This is a way of making sure that the molecule they're thinking about is indeed novel, and someone else hasn't already published about it in the scientific literature - or that it appears in a patent document already. "It turns out that in patent documents in the chemistry domain, most molecules are described in molecular-structure images rather than by the name of the molecules," says Meijer. That problem presented an opportunity: The team wanted to see if they could make an easy-to-use searchable database for molecule structures, one that could outperform the currently available search tools.
PatCID works in two ways, explains Morin. "On the one hand, a user can search for a molecule, using identity, substructure or similarity search. Documents that mention this molecule will then be retrieved. "On the other hand, a user can search for a document and retrieve the molecules displayed in the document," he says. For example, a user may be interested in all the patent documents published by a specific company and want to extract the molecules they contain.
"Basically, PatCID links molecules and their source documents, and users can search on both sides," Morin adds.
Simplicity is key for clients like JSR, says Meijer, because companies in the field average a patent a day, and each one is easily 100 pages or more. "To sift through all that manually is largely impossible," he says.
Beyond just finding what competitors are patenting, PatCID can also help with the discovery of new molecules, ones that are more likely to perform their desired functions. Part of this process is a search for molecular substructures that are common in the industry. And by cataloguing these substructures, industrial chemists can find which molecular components are most often used together.
"For a particular combination, if there is a large number of images of those molecule substructures in patents, then likely they are easier to synthesize because more chemists are talking about them," Meijer points out. "If there is only one or two occurrences, then maybe it's not as technically feasible to synthesize molecules with these substructures."
PatCID lets researchers build a co-occurrence matrix from all the potential pairings of substructures, rank them by predicted ease of synthesis, and pick out a few that aren't in the literature yet. Chemists can then attempt to synthesize molecules that include these components.
As capable as PatCID is, it's just the start. Part of what makes the prior-art search so complicated is that many patent documents in the chemistry domain contain so-called Markush structures, which can befuddle AI search tools. A Markush structure is a sort of template that leaves some of the groups in a molecule defined as a category, so the patent effectively covers multiple possible molecules. These groups are defined in accompanying text, and an AI model can have trouble parsing this combination.
"For humans, it's already often not so straightforward to understand what actually is claimed with a Markush structure," Meijer says. "For an AI model, it will be even more difficult to understand what is described in the text section, and then combine what it reads in the text with what is depicted in the accompanying image."
This task would require a multimodal model, and it's the apogee of automated chemistry understanding, he says. Over the next three years, this is exactly what Meijer, Weber, and Morin, and their colleagues, plan to work on.