Grammar-based automated generation of annotated synthetic form training data for machine learning

US10970530B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10970530-B1
Application numberUS-201816189633-A
CountryUS
Kind codeB1
Filing dateNov 13, 2018
Priority dateNov 13, 2018
Publication dateApr 6, 2021
Grant dateApr 6, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for grammar-based automated generation of annotated synthetic form training data for machine learning are described. A training data generation engine utilizes a defined grammar to construct a layout for a form, select key-value units to place within the layout, and select attribute variants for the key-value units. The form is rendered and stored at a storage location, where it can be provided along with other similarly-generated forms to be used as training data for a machine learning model.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: generating a plurality of images including representations of a corresponding plurality of forms, wherein for each image of the plurality of images the generating comprises: determining, based on a defined grammar, a plurality of document sections for the form, selecting, for each of the plurality of document sections based on the defined grammar, one or more key-value units to be placed in the document section, selecting, for each of the selected one or more key-value units, one or more style attributes for the key-value unit based on a random value, and placing the one or more key-value units within the document section with the selected style attributes; storing the plurality of images to a storage location along with a corresponding plurality of annotations; and providing the plurality of images and the plurality of annotations to be used to train a machine learning (ML) model. 2. The computer-implemented method of claim 1 , wherein for each image of the plurality of images the generating further comprises: generating, for at least one of the one or more key-value units of the image, a value to be placed within the at least one key-value unit; determining, based on a second random value, a characteristic comprising at least one of a location, a font size, a font, or a font style for the value; and placing the value within the at least one key-value unit according to the determined characteristic. 3. The computer-implemented method of claim 1 , wherein the defined grammar specifies that at least one pair of key-value units that are to be placed adjacent to one another. 4. A computer-implemented method comprising: generating a plurality of documents, wherein for each document of the plurality of documents the generating comprises: determining, based on a defined grammar, a plurality of sections for the document, selecting, for each of the plurality of sections based on the defined grammar, one or more key-value units to be placed in the section, and selecting, for each of the selected one or more key-value units, one or more style attributes and placing the one or more key-value units within the section; and storing the plurality of documents to a storage location. 5. The computer-implemented method of claim 4 , wherein the plurality of documents includes representations of forms that are stored as image files. 6. The computer-implemented method of claim 5 , further comprising: generating a plurality of annotation data structures corresponding to the plurality of documents, each annotation data structure indicating at least locations of the one or more key-value units within the corresponding document; and storing the plurality of annotation data structures along with the plurality of documents at the storage location. 7. The computer-implemented method of claim 6 , further comprising: obtaining the plurality of annotation data structures and the plurality of documents from the storage location; and utilizing the plurality of annotation data structures and the plurality of documents to train a machine learning (ML) model. 8. The computer-implemented method of claim 4 , further comprising: selecting, based on a random value, the defined grammar from a plurality of defined grammars. 9. The computer-implemented method of claim 4 , further comprising: for at least one of the one or more key-value units of a document, obtaining a key; determining, based on a randomization, one or more style attributes for the key; obtaining a value; determining, based on a randomization, one or more style attributes for the value; and placing the key and the value within the at least one key-value unit according to the one or more style attributes for the key and according to the one or more style attributes for the value. 10. The computer-implemented method of claim 9 , wherein the one or more style attributes for the key include one or more of: a stride amount between characters of the key; a font size; or a font. 11. The computer-implemented method of claim 9 , wherein: obtaining the key comprises selecting the key from a dictionary of keys; and obtaining the value comprises generating the value based on the key. 12. The computer-implemented method of claim 4 , wherein: the selecting of each of the one or more style attributes for at least one of the key-value units is based on a random value; the one or more style attributes for at least one of the key-value units include at least one of: a width of a line; a style of a line; a color of a line; a background fill color; a margin or padding amount; a position of the key; or a position of the value. 13. The computer-implemented method of claim 4 , wherein the generating further comprises: for one of the plurality of documents, determining the plurality of sections for the document includes identifying a hierarchy of sections, wherein a first section of the hierarchy includes a second section of the hierarchy that is to be placed within the first section. 14. A system comprising: a storage service implemented by a first one or more electronic devices; and a training data generation engine implemented by a second one or more electronic devices, the training data generation engine including instructions that upon execution cause the training data generation engine to: generate a plurality of documents, wherein for each document of the plurality of documents, the training data generation engine is to: determine, based on a defined grammar, a plurality of sections for the document, select, for each of the plurality of sections, one or more key-value units, and select, for each of the selected one or more key-value units, one or more style attributes and place the one or more key-value units within the section; and store the plurality of documents to a storage location provided by the storage service. 15. The system of claim 14 , wherein the plurality of documents includes representations of forms and are stored as image files. 16. The system of claim 15 , wherein the training data generation engine is further to: generate a plurality of annotation data structures corresponding to the plurality of documents, each annotation data structure indicating at least locations of the one or more key-value units within the document; and store the plurality of annotation data structures along with the plurality of documents at the storage location. 17. The system of claim 16 , further comprising another service of a same provider network as the training data generation engine and the storage service, the another service including instructions that upon execution cause the another service to: obtain the plurality of annotation data structures and the plurality of documents from the storage location; and utilize the plurality of annotation data structures and the plurality of documents to train a machine learning (ML) model. 18. The system of claim 14 , wherein the training data generation engine is further to: select, according to a second random value, the defined grammar from a plurality of defined grammars. 19. The system of claim 14 , wherein the training data generation engine is further to: for at least one of the one or more key-value units of a document, obtain a key; determine one or more style attributes for the key based on a random value; obtain a value; determine one or more style attributes for the value based on a random value; and place t

Assignees

Inventors

Classifications

  • G06F40/174Primary

    Form filling; Merging · CPC title

  • Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title

  • characterised by the process organisation or structure, e.g. boosting cascade · CPC title

  • Indexing; Data structures therefor; Storage structures (for retrieval from the web G06F16/951) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10970530B1 cover?
Techniques for grammar-based automated generation of annotated synthetic form training data for machine learning are described. A training data generation engine utilizes a defined grammar to construct a layout for a form, select key-value units to place within the layout, and select attribute variants for the key-value units. The form is rendered and stored at a storage location, where it can …
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/174. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 06 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).