Alternating Positioning of Primary Text
US-2024419887-A1 · Dec 19, 2024 · US
US2025217601A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025217601-A1 |
| Application number | US-202418401768-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jan 2, 2024 |
| Priority date | Jan 2, 2024 |
| Publication date | Jul 3, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
According to one embodiment, a method, computer system, and computer program product for generating synthetic business documents for data augmentation is provided. The embodiment may include identifying a first set of key value pairs (KVPs) within a set of business documents spanning multiple domains. The embodiment may include creating domain-agnostic models of spatial distribution and content distribution of KVPs within the set. The embodiment may include grounding the domain-agnostic models of spatial distribution and content distribution using a second set of domain-specific KVPs to derive domain-specific models of spatial distribution and content distribution of KVPs within the second set. The embodiment may include generating a set of synthetic domain-specific business documents using the derived domain-specific models of spatial distribution and content distribution. The embodiment may include augmenting a training data set of a large language model (LLM) with the set of synthetic domain-specific business documents.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method, the method comprising: identifying a first set of key value pairs (KVPs) within a set of business documents spanning multiple domains; creating a domain-agnostic model of spatial distribution of KVPs within the set; creating a domain-agnostic model of content distribution of KVPs within the set; grounding the domain-agnostic model of spatial distribution and the domain-agnostic model of content distribution using a second set of domain-specific KVPs to derive domain-specific models of spatial distribution and content distribution of KVPs within the second set; generating a set of synthetic domain-specific business documents using the derived domain-specific models of spatial distribution and content distribution; and augmenting a training data set of a large language model (LLM) with the set of synthetic domain-specific business documents. 2 . The method of claim 1 , wherein identifying the first set of KVPs comprises parsing the set of business documents to extract KVPs via a technique selected from the group consisting of optical character recognition (OCR), intelligent character recognition (ICR), and named entity recognition (NER). 3 . The method of claim 1 , wherein spatial distributions of the KVPs of the first set and the KVPs of the second set are modeled as multivariate Gaussian distributions. 4 . The method of claim 1 , wherein modeling content distributions of KVPs comprises identifying structural and semantic patterns for displaying values corresponding to different keys. 5 . The method of claim 1 , wherein the second set of domain-specific KVPs is a subset of the first set of KVPs. 6 . The method of claim 1 , wherein a document of the set of synthetic domain-specific business documents is generated by sampling locations of different domain-specific KVPs from the derived domain-specific model of spatial distribution such that there is no overlap in placement of the domain-specific KVPs within the document, and wherein a location of a domain-specific KVP is represented by a vector which specifies the positional coordinates of a bounding box which corresponds to the domain-specific KVP. 7 . The method of claim 1 , further comprising: training the LLM using the augmented training data set for performance of domain-specific document understanding tasks. 8 . A computer system, the computer system comprising: one or more processors, one or more computer-readable memories, one or more computer-readable tangible storage medium, and program instructions stored on at least one of the one or more tangible storage medium for execution by at least one of the one or more processors via at least one of the one or more memories, wherein the computer system is capable of performing a method comprising: creating a domain-agnostic model of spatial distribution of KVPs within the set; creating a domain-agnostic model of content distribution of KVPs within the set; grounding the domain-agnostic model of spatial distribution and the domain-agnostic model of content distribution using a second set of domain-specific KVPs to derive domain-specific models of spatial distribution and content distribution of KVPs within the second set; generating a set of synthetic domain-specific business documents using the derived domain-specific models of spatial distribution and content distribution; and augmenting a training data set of a large language model (LLM) with the set of synthetic domain-specific business documents. 9 . The computer system of claim 8 , wherein identifying the first set of KVPs comprises parsing the set of business documents to extract KVPs via a technique selected from the group consisting of optical character recognition (OCR), intelligent character recognition (ICR), and named entity recognition (NER). 10 . The computer system of claim 8 , wherein spatial distributions of the KVPs of the first set and the KVPs of the second set are modeled as multivariate Gaussian distributions. 11 . The computer system of claim 8 , wherein modeling content distributions of KVPs comprises identifying structural and semantic patterns for displaying values corresponding to different keys. 12 . The computer system of claim 8 , wherein the second set of domain-specific KVPs is a subset of the first set of KVPs. 13 . The computer system of claim 8 , wherein a document of the set of synthetic domain-specific business documents is generated by sampling locations of different domain-specific KVPs from the derived domain-specific model of spatial distribution such that there is no overlap in placement of the domain-specific KVPs within the document, and wherein a location of a domain-specific KVP is represented by a vector which specifies the positional coordinates of a bounding box which corresponds to the domain-specific KVP. 14 . The computer system of claim 8 , further comprising: training the LLM using the augmented training data set for performance of domain-specific document understanding tasks. 15 . A computer program product, the computer program product comprising: one or more computer-readable tangible storage medium and program instructions stored on at least one of the one or more tangible storage medium, the program instructions executable by a processor capable of performing a method, the method comprising: creating a domain-agnostic model of spatial distribution of KVPs within the set; creating a domain-agnostic model of content distribution of KVPs within the set; grounding the domain-agnostic model of spatial distribution and the domain-agnostic model of content distribution using a second set of domain-specific KVPs to derive domain-specific models of spatial distribution and content distribution of KVPs within the second set; generating a set of synthetic domain-specific business documents using the derived domain-specific models of spatial distribution and content distribution; and augmenting a training data set of a large language model (LLM) with the set of synthetic domain-specific business documents. 16 . The computer program product of claim 15 , wherein identifying the first set of KVPs comprises parsing the set of business documents to extract KVPs via a technique selected from the group consisting of optical character recognition (OCR), intelligent character recognition (ICR), and named entity recognition (NER). 17 . The computer program product of claim 15 , wherein spatial distributions of the KVPs of the first set and the KVPs of the second set are modeled as multivariate Gaussian distributions. 18 . The computer program product of claim 15 , wherein modeling content distributions of KVPs comprises identifying structural and semantic patterns for displaying values corresponding to different keys. 19 . The computer program product of claim 15 , wherein the second set of domain-specific KVPs is a subset of the first set of KVPs. 20 . The computer program product of claim 15 , wherein a document of the set of synthetic domain-specific business documents is generated by sampling locations of different domain-specific KVPs from the derived domain-specific model of spatial distribution such that there is no overlap in placement of the domain-specific KVPs within the document, and wherein a location of a domain-specific KVP is represented by a vector which specifies the positional coordinates of a bounding box which corresponds to the domain-specific KVP.
Semantic analysis · CPC title
Parsing · CPC title
Named entity recognition · CPC title
Character recognition · CPC title
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.