Method and apparatus for determining a document type of a digital document

US10152648B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10152648-B2
Application numberUS-201615197143-A
CountryUS
Kind codeB2
Filing dateJun 29, 2016
Priority dateJun 26, 2003
Publication dateDec 11, 2018
Grant dateDec 11, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

There is disclosed a method of determining a document type associated with a digital document, the method executable by an electronic device. A processor of the electronic device is configured to execute a plurality of machine learning algorithm (MLA) classifiers, each of the plurality of MLA classifiers having been trained to identify a specific document type. The plurality of MLA classifiers is ranked in a hierarchical order of execution of the plurality of MLA classifiers. A method of training the plurality of MLA classifiers is also disclosed.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method of determining a document type associated with a digital document, the method comprising: training, using a first training set comprising a plurality of documents of a first document type, a first machine learning algorithm (MLA) classifier of a plurality of MLA classifiers; responsive to determining that the first MLA classifier confidently identifies the first document type, excluding the first document type from a second training set; training, using the second training set, a second MLA classifier of the plurality of MLA classifiers; acquiring, via a digital document interface, a digital document; executing, by a processor, the first MLA classifier in order to determine a document type for the digital document, the first MLA classifier being associated with a first hierarchical order of execution; responsive to determining that the document type produced by the first MLA classifier is not one of confidently predictable document types associated with the first MLA classifier, executing, by the processor, the second MLA classifier in order to determine the document type for the digital document, the second MLA classifier being associated with a second hierarchical order of execution immediately following the first hierarchical order of execution. 2. The method of claim 1 , further comprising: responsive to determining that the document type produced by the first MLA classifier is one of confidently predictable document types associated with the first MLA classifier, assigning the document type produced by the first MLA classifier to the digital document. 3. The method of claim 1 , further comprising: responsive to determining that the document type produced by the first MLA classifier is not one of confidently predictable document types associated with the first MLA classifier, determining whether the document type produced by the second MLA classifier is one of confidently predictable document types associated with the second MLA classifier; responsive to determining that the document type produced by the second MLA classifier is not one of confidently predictable document types associated with the second MLA classifier, executing, by the processor, a third MLA classifier of the plurality of MLA classifiers in order to determine the document type for the digital document, the third MLA classifier being associated with a third hierarchical order of execution immediately following the second hierarchical order of execution. 4. The method of claim 3 , further comprising: responsive to determining that the document type produced by the second MLA classifier is one of confidently predictable document types associated with the second MLA classifier, assigning the document type produced by the second MLA classifier to the digital document. 5. The method of claim 1 , wherein the plurality of MLA classifiers includes: the first MLA classifier; the second MLA classifier; a third MLA classifier, and a fourth MLA classifier. 6. The method of claim 5 , wherein each of the first MLA classifier, the second MLA classifier, the third MLA classifier, and the fourth MLA classifier have been independently trained. 7. The method of claim 5 , wherein the first MLA classifier is a raster-based classifier. 8. The method of claim 5 , wherein the second MLA classifier is a logotype-based classifier. 9. The method of claim 5 , wherein the third MLA classifier is a rule-based classifier. 10. The method of claim 9 , wherein the third MLA classifier is further configured to execute an Optical Character Recognition (OCR) function on at least a pre-determined portion of the digital document. 11. The method of claim 5 , wherein the fourth MLA classifier is a text-based classifier. 12. The method of claim 11 , wherein the fourth MLA classifier is further configured to execute an Optical Character Recognition (OCR) function on substantially an entirety of the digital document. 13. The method of claim 1 , wherein the digital document is provided by one of: a rigidly-structured document, a nearly-rigidly-structured document, a semi-structured document, and an un-structured document. 14. The method of claim 13 , further comprising: based on the document type, executing a computer-executable action with respect to the digital document. 15. The method of claim 1 , wherein the first document type is associated with a confidence parameter which is above a first pre-determined threshold and has a difference between the confidence parameter and a next-document-type hypothesis confidence parameter that is above a second pre-determined threshold. 16. The method of claim 14 , wherein training the first MLA classifier further comprises: determining a confidence parameter associated with an output of the first MLA classifier. 17. The method of claim 16 , wherein training the first MLA classifier further comprises: analyzing the confidence parameter for a given document type, and in response to one of: the confidence parameter being below a first pre-determined threshold or a difference between the confidence parameter and a next-document-type hypothesis confidence parameter being below a second pre-determined threshold, excluding the given document type from confidently predictable document types associated with the first MLA classifier. 18. The method of claim 16 , wherein training the first MLA classifier further comprises: analyzing the confidence parameter for a document type, and in response to the confidence parameter being above a first pre-determined threshold and a difference between the confidence parameter and a next-document-type confidence parameter being above a second pre-determined threshold, determining that the document type is one of confidently predictable document types by the first MLA classifier. 19. The method of claim 18 , wherein training the first MLA classifier further comprises: based on comparing document types produced by the first MLA classifier for a validation set of documents with labels associated with the validation set of documents: responsive to determining that precision and recall parameters for a document of the validation set exceed corresponding threshold values of precision and recall parameters, associating a type of the document with confidently predictable document types by the first MLA classifier. 20. The method of claim 1 , wherein the digital document interface comprises a network interface and wherein the acquiring comprises: receiving the digital document over a communication network. 21. The method of claim 1 , wherein the digital document interface comprises a scanner, and wherein the acquiring comprises: receiving a scanned version of a paper-based document. 22. An electronic device comprising: a digital document interface; a data storage device; a processor coupled to the digital document interface and to the data storage device; wherein the processor is configured to: train, using a first training set comprising a plurality of documents of a first document type, a first machine learning algorithm (MLA) classifier of a plurality of MLA classifiers; responsive to determining that the first MLA classifier confidently identifies the first document type, exclude the first document type from a second training set; train, using the second training set, a second MLA classifier of the plurality of MLA classifiers; acquire, via the digital document interface, a digital document; execute the first MLA

Assignees

Inventors

Classifications

  • Validation; Performance evaluation · CPC title

  • using rules for classification or partitioning the feature space · CPC title

  • using classification, e.g. of video objects · CPC title

  • G06V30/40Primary

    Document-oriented image-based pattern recognition · CPC title

  • Piecewise classification, i.e. whereby each classification requires several discriminant rules · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10152648B2 cover?
There is disclosed a method of determining a document type associated with a digital document, the method executable by an electronic device. A processor of the electronic device is configured to execute a plurality of machine learning algorithm (MLA) classifiers, each of the plurality of MLA classifiers having been trained to identify a specific document type. The plurality of MLA classifiers …
Who is the assignee on this patent?
Abbyy Dev Llc
What technology area does this patent fall under?
Primary CPC classification G06V30/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 11 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).