Synthetic data set generation of chemical illustrations

US12482540B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12482540-B2
Application numberUS-202318124638-A
CountryUS
Kind codeB2
Filing dateMar 22, 2023
Priority dateMar 22, 2023
Publication dateNov 25, 2025
Grant dateNov 25, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The invention is notably directed to a computer-implemented method to generate a synthetic data set. The synthetic data set comprises a plurality of chemical documents and each of the plurality of chemical documents comprises a respective set of chemical objects. The method comprises a step of receiving configuration data, the configuration data comprising a set of configuration parameters for the data set. The method further comprises performing, in an iterative manner, for each of the plurality of chemical documents the steps of generating, by a structure generation module, a respective document structure for each of the respective chemical documents in accordance with the configuration data, generating, by a content generation module, the respective set of chemical objects for the respective chemical document and arranging the respective set of chemical objects on the respective chemical document.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method to generate a synthetic data set, the synthetic data set comprising a plurality of chemical documents, each of the plurality of chemical documents comprising a respective set of chemical objects, the method comprising: receiving configuration data, the configuration data comprising a set of configuration parameters for the synthetic data set, wherein the synthetic data set is a large-scale object detection, segmentation, and captioning data set, which facilitates use of the generated synthetic data set to train artificial intelligence (AI) models; performing, in an iterative manner, for each of the plurality of chemical documents: generating, by a structure generation module, a chemical illustration atlas for a respective document structure for each of the plurality of chemical documents in accordance with the configuration data, wherein cascading style sheets (CSS) are used to diversify an appearance of each of the plurality of chemical documents; generating, by a content generation module, the respective set of chemical objects as a plurality of images with complex chemical structures for each of the plurality of chemical documents; and arranging the respective set of chemical objects on each of the plurality of chemical documents. 2 . The computer-implemented method according to claim 1 , wherein the respective set of chemical objects are selected from a group consisting of: molecules, Markush structures, and reaction setups. 3 . The computer-implemented method according to claim 1 , wherein the respective document structure is selected from a group consisting of: a tabular structure, a synthesis structure, a retrosynthesis structure, and a circular reactions structure. 4 . The computer-implemented method according to claim 1 , wherein each of the plurality of chemical documents are embodied as single page documents or single image documents. 5 . The computer-implemented method according to claim 1 , wherein performing, in an iterative manner, for each of the plurality of chemical documents, further comprises: selecting randomly, for each of the plurality of chemical documents, a style template from a set of style templates, the style template comprising style information; and applying the style information of the style template on each of the plurality of chemical documents. 6 . The computer-implemented method according to claim 1 , wherein generating the respective set of chemical objects for each of the plurality of chemical documents, further comprises: selecting randomly a chemical object file from a chemical database. 7 . The computer-implemented method according to claim 6 , wherein generating the respective set of chemical objects for each of the plurality of chemical documents, further comprises: generating a conformer for the respective set of chemical objects; applying a random rotation to the conformer; and applying random depiction parameters to the conformer. 8 . The computer-implemented method according to claim 4 , wherein the single page documents or single image documents comprise an object file in SMILES-format. 9 . The computer-implemented method according to claim 1 , further comprising randomly injecting noise objects into each of the plurality of chemical documents. 10 . The computer-implemented method according to claim 9 , wherein the noise objects are selected from a group consisting of: arrows, captions, text labels, plots, and/or technical drawings. 11 . The computer-implemented method according to claim 1 , further comprising: converting each of the plurality of chemical documents into Hypertext Markup Language (HTML), thereby generating a set of HTML chemical documents. 12 . The computer-implemented method according to claim 11 , further comprising loading the set of HTML chemical documents in a headless browser; and saving the set of HTML chemical documents in a graphics format. 13 . The computer-implemented method according to claim 12 , wherein the graphics format is selected from a group consisting of: PNG-format, TIFF-format, or JPEG-format. 14 . The computer-implemented method according to claim 1 , further comprising: training an application for analysis of each of the plurality of chemical documents, the method comprising: receiving the synthetic data set; and training a cognitive model with the synthetic data set. 15 . A system for performing a computer-implemented method for generating a synthetic data set for applications for analysis of chemical documents, the synthetic data set comprising a plurality of chemical documents, each of the plurality of chemical documents comprising a respective set of chemical objects, the system comprising a processor and a computer readable memory, the system being configured to: receive configuration data, the configuration data comprising a set of configuration parameters for the synthetic data set, wherein the synthetic data set is a large-scale object detection, segmentation, and captioning data set, which facilitates use of the generated synthetic data set to train artificial intelligence (AI) models; perform, in an iterative manner, for each of the plurality of chemical documents: generate, by a structure generation module, a chemical illustration atlas for a respective document structure for each of the plurality of chemical documents in accordance with the configuration data, wherein cascading style sheets (CSS) are used to diversify an appearance of each of the plurality of chemical documents; generate, by a content generation module, the respective set of chemical objects as a plurality of images with complex chemical structures for each of the plurality of chemical documents; and arrange the respective set of chemical objects on each of the plurality of chemical documents. 16 . The system according to claim 15 , wherein the respective set of chemical objects are selected from a group consisting of: molecules, Markush structures, and reaction setups. 17 . The system according to claim 15 , wherein the respective document structure is selected from a group consisting of: a tabular structure, a synthesis structure, a retrosynthesis structure, and a circular reactions structure. 18 . The system according to claim 15 , wherein each of the plurality of chemical documents are embodied as single page documents or single image documents. 19 . The system according to claim 15 , wherein performing, in an iterative manner, for each of the plurality of chemical documents, further comprises: selecting randomly, for each of the plurality of chemical documents, a style template from a set of style templates, the style template comprising style information; and applying the style information of the style template on each of the plurality of chemical documents. 20 . A computer program product for generating a synthetic data set for applications for analysis of chemical documents by a system comprising a processor and computer readable memory, the synthetic data set comprising a plurality of chemical documents, each of the plurality of chemical documents comprising a respective set of chemical objects, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by the system to cause the system to perform a method comprising: receiving configuration data, the configuration data comprising a set of configurat

Assignees

Inventors

Classifications

  • Templates · CPC title

  • Formatting, i.e. changing of presentation of documents (automatic justification G06F40/189; automatic line break hyphenation G06F40/191) · CPC title

  • Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD] · CPC title

  • Machine learning, data mining or chemometrics · CPC title

  • G16C20/80Primary

    Data visualisation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12482540B2 cover?
The invention is notably directed to a computer-implemented method to generate a synthetic data set. The synthetic data set comprises a plurality of chemical documents and each of the plurality of chemical documents comprises a respective set of chemical objects. The method comprises a step of receiving configuration data, the configuration data comprising a set of configuration parameters for …
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G16C20/80. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).