Generation of synthetic documents for data augmentation

US2025217601A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025217601-A1
Application numberUS-202418401768-A
CountryUS
Kind codeA1
Filing dateJan 2, 2024
Priority dateJan 2, 2024
Publication dateJul 3, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

According to one embodiment, a method, computer system, and computer program product for generating synthetic business documents for data augmentation is provided. The embodiment may include identifying a first set of key value pairs (KVPs) within a set of business documents spanning multiple domains. The embodiment may include creating domain-agnostic models of spatial distribution and content distribution of KVPs within the set. The embodiment may include grounding the domain-agnostic models of spatial distribution and content distribution using a second set of domain-specific KVPs to derive domain-specific models of spatial distribution and content distribution of KVPs within the second set. The embodiment may include generating a set of synthetic domain-specific business documents using the derived domain-specific models of spatial distribution and content distribution. The embodiment may include augmenting a training data set of a large language model (LLM) with the set of synthetic domain-specific business documents.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method, the method comprising: identifying a first set of key value pairs (KVPs) within a set of business documents spanning multiple domains; creating a domain-agnostic model of spatial distribution of KVPs within the set; creating a domain-agnostic model of content distribution of KVPs within the set; grounding the domain-agnostic model of spatial distribution and the domain-agnostic model of content distribution using a second set of domain-specific KVPs to derive domain-specific models of spatial distribution and content distribution of KVPs within the second set; generating a set of synthetic domain-specific business documents using the derived domain-specific models of spatial distribution and content distribution; and augmenting a training data set of a large language model (LLM) with the set of synthetic domain-specific business documents. 2 . The method of claim 1 , wherein identifying the first set of KVPs comprises parsing the set of business documents to extract KVPs via a technique selected from the group consisting of optical character recognition (OCR), intelligent character recognition (ICR), and named entity recognition (NER). 3 . The method of claim 1 , wherein spatial distributions of the KVPs of the first set and the KVPs of the second set are modeled as multivariate Gaussian distributions. 4 . The method of claim 1 , wherein modeling content distributions of KVPs comprises identifying structural and semantic patterns for displaying values corresponding to different keys. 5 . The method of claim 1 , wherein the second set of domain-specific KVPs is a subset of the first set of KVPs. 6 . The method of claim 1 , wherein a document of the set of synthetic domain-specific business documents is generated by sampling locations of different domain-specific KVPs from the derived domain-specific model of spatial distribution such that there is no overlap in placement of the domain-specific KVPs within the document, and wherein a location of a domain-specific KVP is represented by a vector which specifies the positional coordinates of a bounding box which corresponds to the domain-specific KVP. 7 . The method of claim 1 , further comprising: training the LLM using the augmented training data set for performance of domain-specific document understanding tasks. 8 . A computer system, the computer system comprising: one or more processors, one or more computer-readable memories, one or more computer-readable tangible storage medium, and program instructions stored on at least one of the one or more tangible storage medium for execution by at least one of the one or more processors via at least one of the one or more memories, wherein the computer system is capable of performing a method comprising: creating a domain-agnostic model of spatial distribution of KVPs within the set; creating a domain-agnostic model of content distribution of KVPs within the set; grounding the domain-agnostic model of spatial distribution and the domain-agnostic model of content distribution using a second set of domain-specific KVPs to derive domain-specific models of spatial distribution and content distribution of KVPs within the second set; generating a set of synthetic domain-specific business documents using the derived domain-specific models of spatial distribution and content distribution; and augmenting a training data set of a large language model (LLM) with the set of synthetic domain-specific business documents. 9 . The computer system of claim 8 , wherein identifying the first set of KVPs comprises parsing the set of business documents to extract KVPs via a technique selected from the group consisting of optical character recognition (OCR), intelligent character recognition (ICR), and named entity recognition (NER). 10 . The computer system of claim 8 , wherein spatial distributions of the KVPs of the first set and the KVPs of the second set are modeled as multivariate Gaussian distributions. 11 . The computer system of claim 8 , wherein modeling content distributions of KVPs comprises identifying structural and semantic patterns for displaying values corresponding to different keys. 12 . The computer system of claim 8 , wherein the second set of domain-specific KVPs is a subset of the first set of KVPs. 13 . The computer system of claim 8 , wherein a document of the set of synthetic domain-specific business documents is generated by sampling locations of different domain-specific KVPs from the derived domain-specific model of spatial distribution such that there is no overlap in placement of the domain-specific KVPs within the document, and wherein a location of a domain-specific KVP is represented by a vector which specifies the positional coordinates of a bounding box which corresponds to the domain-specific KVP. 14 . The computer system of claim 8 , further comprising: training the LLM using the augmented training data set for performance of domain-specific document understanding tasks. 15 . A computer program product, the computer program product comprising: one or more computer-readable tangible storage medium and program instructions stored on at least one of the one or more tangible storage medium, the program instructions executable by a processor capable of performing a method, the method comprising: creating a domain-agnostic model of spatial distribution of KVPs within the set; creating a domain-agnostic model of content distribution of KVPs within the set; grounding the domain-agnostic model of spatial distribution and the domain-agnostic model of content distribution using a second set of domain-specific KVPs to derive domain-specific models of spatial distribution and content distribution of KVPs within the second set; generating a set of synthetic domain-specific business documents using the derived domain-specific models of spatial distribution and content distribution; and augmenting a training data set of a large language model (LLM) with the set of synthetic domain-specific business documents. 16 . The computer program product of claim 15 , wherein identifying the first set of KVPs comprises parsing the set of business documents to extract KVPs via a technique selected from the group consisting of optical character recognition (OCR), intelligent character recognition (ICR), and named entity recognition (NER). 17 . The computer program product of claim 15 , wherein spatial distributions of the KVPs of the first set and the KVPs of the second set are modeled as multivariate Gaussian distributions. 18 . The computer program product of claim 15 , wherein modeling content distributions of KVPs comprises identifying structural and semantic patterns for displaying values corresponding to different keys. 19 . The computer program product of claim 15 , wherein the second set of domain-specific KVPs is a subset of the first set of KVPs. 20 . The computer program product of claim 15 , wherein a document of the set of synthetic domain-specific business documents is generated by sampling locations of different domain-specific KVPs from the derived domain-specific model of spatial distribution such that there is no overlap in placement of the domain-specific KVPs within the document, and wherein a location of a domain-specific KVP is represented by a vector which specifies the positional coordinates of a bounding box which corresponds to the domain-specific KVP.

Assignees

Inventors

Classifications

  • Semantic analysis · CPC title

  • Parsing · CPC title

  • Named entity recognition · CPC title

  • Character recognition · CPC title

  • G06F40/40Primary

    Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025217601A1 cover?
According to one embodiment, a method, computer system, and computer program product for generating synthetic business documents for data augmentation is provided. The embodiment may include identifying a first set of key value pairs (KVPs) within a set of business documents spanning multiple domains. The embodiment may include creating domain-agnostic models of spatial distribution and content…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jul 03 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).