Identification of key segments in document images

US10699112B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10699112-B1
Application numberUS-201816146562-A
CountryUS
Kind codeB1
Filing dateSep 28, 2018
Priority dateSep 28, 2018
Publication dateJun 30, 2020
Grant dateJun 30, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method of automatically learning new keywords in a document image based on context such as when a never before seen keyword exists surrounded by other key-value pairs. A machine learning based approach leverages subword embeddings and two-dimensional geometric contexts in a gradient boosted trees classifier. Keys may be composed of multi-word strings or single-word strings.

First claim

Opening claim text (preview).

What is claimed is: 1. A computerized method for identifying keywords in a document image, comprising: (i) retrieving a document image from a set of document images where each document in the set of document images contains information organized in a two-dimensional structure and contains keywords, where each keyword of a set of the keywords has a value associated therewith; (ii) processing the document image to identify text segments contained within the document image; (iii) processing the text segments to identify subword embeddings associated with each of the text segments, wherein each of the subword embeddings associated with a text segment represents a character group in the document image, (iv) generating an n-dimensional vector for each text segment from its subword embeddings; (v) for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment; (vi) retrieving an annotated version of the document image containing a visual indication annotation associated with each visual indication of a keyword in the document; (vii) associating with each visual indication of a keyword in the annotated version of the document image a corresponding feature vector to generate a training document; and (viii) repeating steps (i) through (vii) for each document from the set of document images to generate a set of training documents. 2. The computerized method of claim 1 wherein each of the subword embeddings utilize a vector representation of one or more n-character groupings of a word, where n is a preselected integer and where a word is represented by a sum of the vector representations. 3. The computerized method of claim 1 wherein step (v) comprises: selecting for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment. 4. The computerized method of claim 1 wherein step (v) comprises: selecting for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment, and which overlap the identified text segment by greater than a preselected overlap amount. 5. The computerized method of claim 1 wherein step (v) comprises: selecting for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment, and wherein the feature vector comprises a concatenation of vectors corresponding to the identified text segment and of a vector corresponding to each of the text segments that are positioned above, below, to the left and to the right of the identified text segment. 6. The computerized method of claim 1 further comprising providing the set of training documents to a supervised learning engine to generate a trained model. 7. A document processing system comprising: data storage for storing a set of document images where each document in the set of document images contains information organized in a two-dimensional structure and contains keywords, where each keyword of a set of the keywords has a value associated therewith; and a processor operatively coupled to the data storage and configured to execute instructions that when executed cause the processor to generate a set of training documents from at least a portion of the documents in the set of document images by, for each document in the portion of the documents in the set of document images: retrieving a document image from the data storage; processing the document image to identify text segments contained within the document image; processing the text segments to identify subword embeddings associated with each of the text segments, wherein each subword embedding associated with a text segment represents a character group in the document image, generating an n-dimensional vector for each text segment from its subword embeddings; for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment; retrieving an annotated version of the document image containing a visual indication annotation associated with each visual indication of a keyword in the document; and associating with each visual indication of a keyword in the annotated version of the document image a corresponding feature vector to generate a training document for the set of training documents. 8. The document processing system of claim 7 wherein the subword embeddings utilize a vector representation of one or more n-character groupings of a word, where n is a preselected integer and where a word is represented by a sum of the vector representations. 9. The document processing system of claim 7 wherein the instructions that when executed cause, for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment, comprise instructions that when executed cause the processor to: select for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment. 10. The document processing system of claim 7 wherein the instructions that when executed cause, for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment, comprise instructions that when executed cause the processor to: select for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment and which overlap the identified text segment by greater than a preselected overlap amount. 11. The document processing system of claim 7 wherein the instructions that when executed cause, for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment, comprise instructions that when executed cause the processor to: select for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment, and wherein the feature vector comprises a concatenation of vectors corresponding to the identified text segment and of a vector corresponding to each of the text segments that are positioned above, below, to the left and to the right of the identified text segment. 12. The document processing system of claim 7 further comprising instructions that when executed cause the processor to provide the set of training documents to a supervised learning engine to generate a trained model. 13. A computer program product for generating a set of training documents, the computer program product comprising a non-transitory computer readable storage medium and including instructions for causing the computer system to execute a method for generating a set of training documents, the method comprising the actions of: retrieving a document image from data storage which has stored thereon a set o

Assignees

Inventors

Classifications

  • G06F16/313Primary

    Selection or weighting of terms for indexing · CPC title

  • based on positionally close patterns or neighbourhood relationships · CPC title

  • Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors · CPC title

  • Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10699112B1 cover?
A system and method of automatically learning new keywords in a document image based on context such as when a never before seen keyword exists surrounded by other key-value pairs. A machine learning based approach leverages subword embeddings and two-dimensional geometric contexts in a gradient boosted trees classifier. Keys may be composed of multi-word strings or single-word strings.
Who is the assignee on this patent?
Automation Anywhere Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/313. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 30 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).