What technology area does this patent fall under?

Primary CPC classification G06V30/414. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 25 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Hierarchical information extraction using document segmentation and optical character recognition correction

US10755093B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10755093-B2
Application number	US-201715620733-A
Country	US
Kind code	B2
Filing date	Jun 12, 2017
Priority date	Jan 27, 2012
Publication date	Aug 25, 2020
Grant date	Aug 25, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods, and media for extracting and processing entity data included in an electronic document are provided herein. Methods may include executing one or more extractors to extract entity data within an electronic document based upon an extraction model for the document, selecting extracted entity data via one or more experts, each of the experts applying at least one business rule to organize at least a portion of the selected entity data into a desired format, and providing the organized entity data for use by an end user.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for providing extracted entity data from electronic documents, the method comprising: receiving entity data extracted from an electronic document, the electronic document comprising a scanned version of a hardcopy document; selecting extracted entity data via two or more experts, each of the experts applying at least one unique business rule to organize at least a portion of the selected entity data into a desired format, wherein the at least one unique business rule comprises a set of slots that comprise properties that define conditions for filling the set of slots with table cell data that includes the extracted entity data; preventing extraction of entity data from a section of the electronic document having distorted content by: generating a first-order hidden markov model for each section of the electronic document, based upon a layout of the document; applying the first-order hidden markov model to a section of the electronic document that includes distorted text to determine the most likely hidden states for the section; aligning the section with characters extracted from the section of the electronic document; and configuring one or more extractors and the two or more experts to ignore at least a portion of the electronic document determined to include distorted content, based upon the alignment; assembling the selected entity data into desired formats; filling a portion of the set of slots with the portions of the selected entity data; and outputting a marked phrase from the organized entity data. 2. The method according to claim 1 , wherein the organized entity data are arranged into an extensible markup language file. 3. The method according to claim 1 , further comprising generating a user interface that includes the organized entity data and a view of the electronic document that includes an annotation for each of the extracted entities. 4. The method according to claim 1 , wherein the layout defines a target section and one or more target entity data included in the target section that are to be extracted by the two or more extractors. 5. The method according to claim 1 , further comprising filling a slot with extracted entity data when the extracted entity data matches the property for the slot. 6. The method according to claim 5 , further comprising validating the slot when the slots of the set are filled with extracted entity data. 7. A system for providing extracted entity data from electronic documents, the system comprising: two or more experts that each: receives entity data extracted from an electronic document, the electronic document comprising a scanned version of a hardcopy document; selects extracted entity data, each of the experts applying at least one unique business rule to organize at least a portion of the selected entity data into a desired format, wherein the at least one unique business rule comprises a set of slots that comprise properties that define conditions for filling the set of slots with table cell data that includes the extracted entity data; assembles the selected entity data into desired formats; and fills a portion of the set of slots with the portions of the selected entity data; a disambiguation module that prevents extraction of entity data from a section of the electronic document having distorted content by: generating a first-order hidden markov model for each section of the electronic document, based upon a layout of the document; applying the first-order hidden markov model to a section of the electronic document that includes distorted text to determine the most likely hidden states for the section; aligning the section with characters extracted from the section of the electronic document; and configuring one or more extractors and the two or more experts to ignore at least a portion of the electronic document determined to include distorted content, based upon the alignment; and an output generator that outputs a marked phrase from the organized entity data. 8. The system according to claim 7 , wherein the output generator organizes the entity data into an extensible markup language file. 9. The system according to claim 7 , wherein the output generator generates a user interface that includes the organized entity data and a view of the electronic document that includes an annotation for each of the extracted entity data. 10. The system according to claim 7 , wherein the layout defines a target section and one or more target entity data included in the target section that are to be extracted by the two or more extractors. 11. The system according to claim 7 , wherein an expert of the two or more experts fills a slot with extracted entity data when the extracted entity data matches the property for the slot. 12. The system according to claim 11 , wherein the expert validates the slot when the slots of the set are filled with extracted entity data. 13. The system according to claim 12 , wherein the expert generates a combined set that includes a validated set and one or more additional slots which are to be filled. 14. A non-transitory computer readable storage media having a program embodied thereon, the program being executable by a processor to perform a method for extracting entity data from electronic documents, the method comprising: receiving entity data extracted from an electronic document, the electronic document comprising a scanned version of a hardcopy document; normalizing the extracted entity data by applying a normalization scheme to the extracted entity data, the normalization scheme converting the extracted entity data, the normalization scheme converting the extracted entity data into a standardized format; selecting extracted entity data via two or more experts, each of the experts applying at least one unique business rule to organize at least a portion of the selected entity data into a desired format, wherein the at least one unique business rule comprises a set of slots that comprise properties that define conditions for filling the set of slots with table cell data that includes the extracted entity data; preventing extraction of entity data from a section of the electronic document having distorted content by: generating a first-order hidden markov model for each section of the electronic document, based upon a layout of the document; applying the first-order hidden markov model to a section of the electronic document that includes distorted text to determine the most likely hidden states for the section; aligning the section with characters extracted from the section of the electronic document; and configuring one or more extractors and the two or more experts to ignore at least a portion of the electronic document determined to include distorted content, based upon the alignment; executing table experts that produce special annotations that identify table cells for the electronic document which include the extracted and normalized entity data; assembling the selected entity data into desired formats; filling a portion of the set of slots with the portions of the selected entity data; and outputting a marked phrase from the organized entity data. 15. A method for disambiguation that prevents extraction of entity data from a section of an electronic document having distorted content, the method comprising: generating a first-order hidden markov model for each section of an electronic document, based upon a layout of the document; applying the first-order hidden markov model to a section of the electronic document that includes distorted text to determine the mo

Assignees

Open Text Holdings Inc

Inventors

Classifications

G06V30/274
Syntactic or semantic context, e.g. balancing · CPC title
G06V30/10
Character recognition · CPC title
G06V30/414Primary
Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title
G06K9/726
Physics · mapped topic
G06K9/00463Primary
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 48871159

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10755093B2 cover?: Systems, methods, and media for extracting and processing entity data included in an electronic document are provided herein. Methods may include executing one or more extractors to extract entity data within an electronic document based upon an extraction model for the document, selecting extracted entity data via one or more experts, each of the experts applying at least one business rule to …
Who is the assignee on this patent?: Open Text Holdings Inc
What technology area does this patent fall under?: Primary CPC classification G06V30/414. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 25 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

User-defined automated document feature extraction and optimization

User-defined automated document feature modeling, extraction and optimization

Quality control calculator for document review

Advanced field extractor with multiple positive examples

Frequently asked questions