System and method for information extraction with character level features

US11055527B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11055527-B2
Application numberUS-201916265519-A
CountryUS
Kind codeB2
Filing dateFeb 1, 2019
Priority dateFeb 1, 2019
Publication dateJul 6, 2021
Grant dateJul 6, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method for information extraction character level features. The system and method may be used for data extraction for various types of content including a receipt or a tax form.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for extracting data, the method comprising: receiving a stream of text from a piece of content the stream of text being received from an optical character recognition process; assigning, using a machine learning information extraction process executed by a processor of the computer, an entity label to each alphanumeric character a stream of text, the entity label providing a character level prediction of a relevance of each alphanumeric character to a piece of content, wherein the stream of text is an output from an optical character recognition process on the piece of content; and extracting, by the processor of the computer, one or more pieces of structured data from the piece of content based on the assigned alphanumeric character entity labels. 2. The method of claim 1 , wherein assigning the entity label to each alphanumeric character further comprises performing a machine learning bi directional long short term memory executed by the processor of the computer to generate a plurality of features that each predict a probability that a particular alphanumeric character corresponds to a particular entity label. 3. The method of claim 2 , wherein assigning the entity label to each alphanumeric character further comprises performing a machine learning conditional random field by the processor of the computer to generate a plurality of character entity labels based on the probability that a particular alphanumeric character corresponds to a particular entity label. 4. The method of claim 3 , wherein assigning the entity label to each alphanumeric character further comprises training the machine learning information extraction process. 5. The method of claim 4 , wherein training the machine learning information extraction process further comprises applying a heuristic rule that trains the machine learning information extraction process. 6. The method of claim 1 further comprising assigning, using a machine learning information extraction process executed by a processor of the computer, an entity label to each word in the stream of text. 7. The method of claim 1 , wherein the piece of content further comprises an image of the piece of content and the piece of content further comprises one of a receipt and a tax form. 8. The method of claim 7 further comprising generating, by a camera of a computing device, the image of the piece of content. 9. The method of claim 1 , wherein the entity label further comprises one of a beginning alphanumeric character in a word and a middle alphanumeric character of the word. 10. The method of claim 1 , wherein extracting one or more pieces of structured data from the piece of content based on the assigned character entity labels further comprises using a bidirectional long short term memory machine learning model to extract the one or more pieces of structured data. 11. An apparatus, comprising: a computer based document understanding platform having a processor that executes a plurality of lines of instructions and configured to: receive a stream of text from a piece of content the stream of text being received from an optical character recognition process; assign, using a machine learning information extraction process an entity label to each alphanumeric character in a stream of text, the entity label providing a character level prediction of a relevance of each alphanumeric character to a piece of content, wherein the stream of text is an output from an optical character recognition process on the piece of content; and extract one or more pieces of structured data from the piece of content based on the assigned alphanumeric character entity labels. 12. The apparatus of claim 11 , wherein the processor is further configured to perform a machine learning bi directional long short term memory to generate a plurality of features that each predict a probability that a particular alphanumeric character corresponds to a particular entity label. 13. The apparatus of claim 12 , wherein the processor is further configured to perform a machine learning conditional random field by the processor of the computer to generate a plurality of character entity labels based on the probability that a particular alphanumeric character corresponds to a particular entity label. 14. The apparatus of claim 13 , wherein the processor is further configured to train the machine learning information extraction process. 15. The apparatus of claim 14 , wherein the processor is further configured to apply a heuristic rule that trains the machine learning information extraction process. 16. The apparatus of claim 11 , wherein the processor is further configured to assign, using a machine learning information extraction process executed by a processor of the computer, an entity label to each word in the stream of text. 17. The apparatus of claim 11 , wherein the piece of content further comprises an image of the piece of content and the piece of content further comprises one of a receipt and a tax form. 18. The apparatus of claim 17 further comprising a computing device having a camera connected to the document understanding platform, wherein the camera captures the image of the piece of content. 19. The apparatus of claim 11 , wherein the entity label further comprises one of a beginning alphanumeric character in a word and a middle alphanumeric character of the word. 20. The apparatus of claim 11 , the processor is further configured to use a bidirectional long short term memory machine learning model to extract the one or more pieces of structured data.

Assignees

Inventors

Classifications

  • G06F40/174Primary

    Form filling; Merging · CPC title

  • Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title

  • Classification techniques · CPC title

  • using neural networks · CPC title

  • Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11055527B2 cover?
A system and method for information extraction character level features. The system and method may be used for data extraction for various types of content including a receipt or a tax form.
Who is the assignee on this patent?
Intuit Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/174. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 06 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).