Methods and systems for classifying data using a hierarchical taxonomy

US9946783B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9946783-B1
Application numberUS-201615153097-A
CountryUS
Kind codeB1
Filing dateMay 12, 2016
Priority dateDec 27, 2011
Publication dateApr 17, 2018
Grant dateApr 17, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and system for classifying documents is provided. A set of document classifiers is generated by applying a classification algorithm to a trusted corpus that includes a set of training documents representing a taxonomy. One or more of the generated document classifiers are executed against a plurality of input documents to create a plurality of classified documents. Each classified document is associated with a classification within the taxonomy and a classification confidence level. One or more classified documents that are associated with a classification confidence level below a predetermined threshold value are selected to create a set of low-confidence documents. The low-confidence documents are disassociated from each of the associated classifications. A user is prompted to enter a classification within the taxonomy for at least one low-confidence document. The low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: identifying, within a set of documents that have been classified within a hierarchical taxonomy using a classification algorithm, documents having a classification confidence level that is below a predetermined confidence level threshold; disassociating the identified documents from their respective classifications based on the classification level being below the predetermined confidence level threshold; obtaining, from a different classifier, a new classification within the hierarchical taxonomy for each of the identified documents; associating each of the newly classified documents with a highest classification confidence level for its respective new classification; including the newly classified documents in a trusted corpus of documents that are used to train the classification algorithm; determining a distribution of classifications of the newly classified documents within the trusted corpus of documents; updating the classification algorithm based on the trusted corpus of documents, such that the classification algorithm is configured to classify documents to promote a classification distribution that is in accordance with the determined distribution of classifications; and applying the updated classification algorithm to at least a portion of the set of documents to obtain new classifications within the taxonomy or new classification confidence levels for the portion of the set of documents, such that the at least a portion of the set of documents are classified in accordance with the classification distribution. 2. The method of claim 1 , wherein the hierarchical taxonomy includes a plurality of levels, wherein each level includes one or more nodes that represent a classification. 3. The method of claim 1 , wherein a classification confidence level for a given document is indicative of an accuracy of an assignment of a classification of the given document and is based on a measure of a degree to which data included in the given document matches attributes of the classification. 4. The method of claim 1 , wherein updating the classification algorithm includes applying a supervised learning model that analyzes the trusted corpus to identify one or more attributes that are associated with classifications of documents in the trusted corpus. 5. The method of claim 1 , wherein the classification algorithm includes a plurality of classifiers, the method further comprising assigning, by each of the classifiers, a different classification to documents that are recognized by the classifier as having attributes that match the classification. 6. The method of claim 5 , further comprising updating the classification algorithm to include at least one new classifier, the new classifier corresponding to a new classification of at least one of the newly classified documents. 7. The method of claim 1 , wherein the at least a portion of the set of documents are classified such that a proportion of documents within the at least a portion of the set of documents that are associated with a given classification is approximate to a proportion of documents within the trusted corpus of documents that have been associated with the given classification. 8. A computer system comprising: one or more memory elements for storing a set of documents that have been classified within a hierarchical taxonomy using a classification algorithm; and one or more processors coupled to the one or more memory elements and including instructions that, when executed, cause the one or more processors to perform operations comprising: identifying, within the set of documents that have been classified within a hierarchical taxonomy using a classification algorithm, documents having a classification confidence level that is below a predetermined confidence level threshold; disassociating the identified documents from their respective classifications based on the classification level being below the predetermined confidence level threshold; obtaining, from a different classifier, a new classification within the hierarchical taxonomy for each of the identified documents; associating each of the newly classified documents with a highest classification confidence level for its respective new classification; including the newly classified documents in a trusted corpus of documents that are used to train the classification algorithm; determining a distribution of classifications of the newly classified documents within the trusted corpus of documents; updating the classification algorithm based on the trusted corpus of documents, such that the classification algorithm is configured to classify documents to promote a classification distribution that is in accordance with the determined distribution of classifications; and applying the updated classification algorithm to at least a portion of the set of documents to obtain new classifications within the taxonomy or new classification confidence levels for the portion of the set of documents, such that the at least a portion of the set of documents are classified in accordance with the classification distribution. 9. The system of claim 8 , wherein the hierarchical taxonomy includes a plurality of levels, wherein each level includes one or more nodes that represent a classification. 10. The system of claim 8 , wherein a classification confidence level for a given document is indicative of an accuracy of an assignment of a classification of the given document and is based on a measure of a degree to which data included in the given document matches attributes of the classification. 11. The system of claim 8 , wherein updating the classification algorithm includes applying a supervised learning model that analyzes the trusted corpus to identify one or more attributes that are associated with classifications of documents in the trusted corpus. 12. The system of claim 8 , wherein the classification algorithm includes a plurality of classifiers, the operations further comprising assigning, by each of the classifiers, a different classification to documents that are recognized by the classifier as having attributes that match the classification. 13. The system of claim 12 , the operations further comprising updating the classification algorithm to include at least one new classifier, the new classifier corresponding to a new classification of at least one of the newly classified documents. 14. The system of claim 8 , wherein the at least a portion of the set of documents are classified such that a proportion of documents within the at least a portion of the set of documents that are associated with a given classification is approximate to a proportion of documents within the trusted corpus of documents that have been associated with the given classification. 15. One or more non-transitory computer-readable media encoded with instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: identifying, within a set of documents that have been classified within a hierarchical taxonomy using a classification algorithm, documents having a classification confidence level that is below a predetermined confidence level threshold; disassociating the identified documents from their respective classifications based on the classification level being below the predetermined confidence level threshold; obtaining, from a different classifier, a new classification within the hierarchical taxonomy for each of the identified documents; associating each of the newly classified documents with a highest class

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9946783B1 cover?
A method and system for classifying documents is provided. A set of document classifiers is generated by applying a classification algorithm to a trusted corpus that includes a set of training documents representing a taxonomy. One or more of the generated document classifiers are executed against a plurality of input documents to create a plurality of classified documents. Each classified docu…
Who is the assignee on this patent?
Google Llc, Google Inc
What technology area does this patent fall under?
Primary CPC classification G06F17/30598. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 17 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).