Extracting structured information from a document containing filled form images

US10755039B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10755039-B2
Application numberUS-201816192028-A
CountryUS
Kind codeB2
Filing dateNov 15, 2018
Priority dateNov 15, 2018
Publication dateAug 25, 2020
Grant dateAug 25, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and process for extracting information from filled form images is described. In one example the claimed invention first extracts textual information and the hierarchy in a blank form. This information is then used to extract and understand the content of filled forms. In this way, the system does not have to analyze from the beginning each filled form. The system is designed so that it remains as generic as possible. The number of hard coded rules in the whole pipeline was minimized to offer an adaptive solution able to address the largest number of forms, with various structures and typography. The system is also created to be integrated as a built-in function in a larger pipeline. The form understanding pipeline could be the starting point of any advanced Natural Language Processing application.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-based method of extracting information from an image of a filled form document, the computer-based method comprising: accessing an image of at least one page of a filled form document, wherein the filled form document includes textual content with textual questions and textual answers and at least one graphical line separating a portion of the textual content; extracting textual content in the image into a set of text lines and extracting a structural layout of the textual content, wherein the structure layout includes a grouping textual content; creating a compositional hierarchy of textual content and the structural layout; assigning a form type to the image based on the compositional hierarchy of the textual content and structural layout; based on the form type being assigned to a known form type, performing vertical merging of two or more lines in the set of text lines based on a relative position of the textual content and an absence of the at least one graphical line separating the two or more lines; comparing the textual content with the form type to identify a set of textual questions; matching a textual answer to a textual question in the set of textual questions by a relative position of the textual question to a textual answer; and creating a logical view of each textual answer including an identifier to the textual question that has been matched. 2. The computer-based method of claim 1 , further comprising: displaying each textual answer that has been matched to the textual question on a form type with a visual appearance distinct from a visual appearance of the set of textual questions on the form type. 3. The computer-based method of claim 2 , wherein the visual appearance includes highlighting, font type, font size, italics, underlining or a combination thereof. 4. The computer-based method of claim 2 , wherein the displaying is updated based on a user selecting at least one of an confidence threshold value for the extracting the textual content, a confidence threshold for determining a textual answer to each textual question in the set of textual questions or a combination thereof. 5. A computer-based method of extracting information from an image of a document with textual content, the computer-based method comprising: accessing an image of at least one page of a document with textual content according to a structural layout separating portions of the textual content; extracting the textual content from the image into a set of text lines and extracting a structural layout of the textual content from the image; creating a compositional hierarchy of the textual content and the structural layout; assigning a form type to the image based on a compositional hierarchy; in response to the form type being assigned to a known form type, performing a vertical merging to combine text lines in the set of text lines based on a relative position of the text lines; matching text in the form type to a subset of text lines to identify a known set of text that includes a set of known textual questions and a set of textual answer candidates; determining a textual question to an answer candidate from the set of known textual questions by a relative position of the textual question to the set of textual answer candidates; and using the form type to create a logical data structure of each textual answer. 6. The computer-based method of claim 5 , wherein the matching text in the form type to the set of text lines to identify a known set of text that includes a set of known textual questions is independent of coordinate locations for text in the set of text lines in the image. 7. The computer-based method of claim 5 , wherein the matching text in the form type to the set of text lines to identify a known set of text that includes a set of known textual questions which are similar in two different coordinate locations in the image for the form type. 8. The computer-based method of claim 5 , wherein the determining a textual question to an answer candidate from the set of known textual questions by a relative position of the textual question to the set of textual answer candidates includes a null textual question to answer candidate match. 9. The computer-based method of claim 5 , wherein the matching text in the form type to the set of text lines to identify a known set of text that includes a set of known textual questions and a set of textual answer candidates further includes a set of known textual content for titles, sections, sub-sections, instructions or a combination thereof. 10. The computer-based method of claim 5 , wherein the matching text in the form type to the set of text lines to identify a known set of text that includes a set of known textual questions includes using a Levenshtein distance between text in the form type and the set of text lines. 11. The computer-based method of claim 5 , wherein the matching text in the form type to the set of text lines to identify a known set of text that includes a set of known textual questions includes prior to the matching text performing includes at least one converting the set of text lines to a one case, removing character spaces in the set of text lines, removing back to line characters in the set of text lines or a combination thereof. 12. The computer-based method of claim 5 , wherein the determining a textual answer to each known textual question in the set of known textual questions by a relative position of the known textual question to a set of textual answer candidates includes discarding the textual answer which includes at least one of begins with a space character, begins with a back of line character, ends with a space character, or a combination thereof. 13. The computer-based method of claim 5 , wherein the determining a textual answer to each known textual question in the set of known textual questions by a relative position of the known textual question to a set of textual answer candidates includes using machine learning to identifying the textual answers using classifications based on at least one of location features, typographical features, visual features, density features, or a combination thereof. 14. The computer-based method of claim 5 , wherein the performing a vertical merging to combine text lines in the set of text lines based on a relative position of the text lines and an absence of at least one graphical line in between the text lines. 15. The computer-based method of claim 5 , wherein the performing a vertical merging to combine text lines in the set of text lines based on a relative position of the text lines includes using machine learning to identify the text lines to combine using classification based on at least one of location features, typographical features, visual features, density features or a combination thereof. 16. The computer-based method of claim 5 , wherein the using the form type to create a logical data structure of each textual answer includes logical data structure of each known textual question. 17. The computer-based method of claim 5 , wherein the extracting the textual content from the image into a set of text lines and extracting a structural layout of the textual content from the image includes detecting graphical lines prior to detecting text. 18. The computer-based method of claim 5 , wherein the extracting the textual content from the image into a set of text lines and extracting a structural layout of the textual content from the image includes detecting tickbox characters with a tickbox classifier us

Assignees

Inventors

Classifications

  • G06F40/137Primary

    Hierarchical processing, e.g. outlines · CPC title

  • using gradient analysis · CPC title

  • removing elements interfering with the pattern to be recognised · CPC title

  • by performing operations on regions, e.g. growing, shrinking or watersheds · CPC title

  • G06F40/174Primary

    Form filling; Merging · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10755039B2 cover?
A system and process for extracting information from filled form images is described. In one example the claimed invention first extracts textual information and the hierarchy in a blank form. This information is then used to extract and understand the content of filled forms. In this way, the system does not have to analyze from the beginning each filled form. The system is designed so that it…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/137. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 25 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).