Classifying structured documents

US9477756B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9477756-B1
Application numberUS-201213350951-A
CountryUS
Kind codeB1
Filing dateJan 16, 2012
Priority dateJan 16, 2012
Publication dateOct 25, 2016
Grant dateOct 25, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Technologies are described herein for classifying structured documents based on the structure of the document. A structured document is received, and the structural elements are parsed from the document to generate a text string representing the structure of the document instead of the semantic textual content of the document. The text string may be broken into N-grams utilizing a sliding window, and a classifier trained from similar structured documents labeled as belonging to one of a number of document classes is utilized to determine a probability that the document belongs to each of the document classes based on the N-grams.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method of classifying documents, the method comprising executing instructions in a computer system to perform operations comprising: receiving a hypertext markup language (HTML) document; parsing HTML tags from the HTML document; generating a text string comprising a word for each of the HTML tags; deriving metadata from the HTML document; adding the metadata to the text string, the metadata representing a complexity value of a structure of the HTML document based upon a maximum depth of nested HTML tags in the HTML document; breaking the text string into N-grams utilizing a sliding window; utilizing a naïve Bayesian classifier trained from similar structured documents labeled as belonging to a first document class to determine a first probability that the HTML document belongs to the first document class based at least in part upon the N-grams, the naïve Bayesian classifier using the complexity value of the structure of the HTML document as a coefficient in determination of the first probability; classifying the document as belonging to the first document class based at least in part on the first probability that the HTML document belonging to the first document class that satisfies a threshold condition; and in response to the first probability not satisfying the threshold condition, utilizing the naïve Bayesian classifier trained from similar structured documents labeled as belonging to a second document class to determine a second probability that the HTML document belongs to the second document class based at least in part on the N-grams, the naïve Bayesian classifier using the complexity value of the structure of the HTML document as the coefficient in determination of the second probability. 2. The computer-implemented method of claim 1 , wherein the metadata added to the text string comprises an identification of an author of the HTML document to be utilized by the naïve Bayesian classifier. 3. The computer-implemented method of claim 1 , wherein the HTML document and at least one of the similar structured documents comprise HTML documents generated from a same template. 4. A system for classifying documents, the system comprising: a computing device; and a document classification module executing on the computing device and configured to at least: train an N-gramming classifier from a plurality of training documents, each of the plurality of training documents labeled as belonging to one or a plurality of document classes; receive a structured document; parse structural elements from the structured document; generate a text string representing a structure of the structured document from the structural elements, wherein the N-gramming classifier breaks the text string into N-grams utilizing a sliding window; derive metadata from the structured document; add the metadata to the text string, the metadata representing a complexity value of the structure of the structured document to be utilized by the N-gramming classifier based upon a maximum depth of nested structural elements in the structured document; utilize the N-gramming classifier to classify the structured document as belonging to one of the plurality of document classes based on the text string representing the structure of the structured document, the N-gramming classifier using the complexity value of the structure of the structured document as a coefficient for a first probability that the structured document belongs to one of the plurality of document classes; determine whether the first probability that the structured document belongs to a first document class satisfies a threshold value; if the first probability that the structured document belongs to the first document class satisfies the threshold value, classifying the structured document as belonging to the first document class; and if the first probability that the structured document belongs to the first document class does not satisfy the threshold value, utilizing the classifier trained from a plurality of training documents labeled as belonging to a second document class to determine a second probability that the structured document belongs to the second document class based on the text string representing the structure of the structured document. 5. The system of claim 4 , wherein the N-gramming classifier comprises a naïve Bayesian classifier. 6. The system of claim 4 , wherein the structured document comprises a hypertext markup language (HTML) document and wherein the text string representing the structure of the structured document comprises a word for each of the HTML tags in the HTML document. 7. The system of claim 6 , wherein the text string representing the structure of the structured document further comprises a word for an attribute of at least one of the HTML tags in the HTML document. 8. The system of claim 6 , wherein the structured document comprises an HTML-based email message generated from a template. 9. The system of claim 4 , wherein the metadata added to the text string comprises an identification of an author of the structured document to be utilized by the N-gramming classifier in classifying the structured document. 10. A non-transitory computer-readable storage medium having computer-executable instructions stored thereon that, when executed by a computing device, cause the computing device to at least: receive a structured document; parse structural elements from the structured document; generate a text string representing a structure of the structured document from the structural elements, wherein the text string does not comprise textual content from the structured document; derive metadata from the structured document; add the metadata to the text string, the metadata representing a complexity value of the structure of the structured document based at least in part upon a maximum depth of nested structural elements in the structured document; group structural elements in the text string into N-grams utilizing a sliding window; utilize a classifier trained from a plurality of training documents labeled as belonging to a first document class to determine a probability that the structured document belongs to the first document class based on the N-grams, the classifier using the complexity value of the structure of the structured document as a coefficient for a first probability that the structured document belongs to the first document class, wherein the complexity value has a relatively lower value for a first set of documents that are similar and include a simple structure while the complexity value has a relatively higher value for a second set of documents that are relatively less similar and include a relatively more complex structure as compared to the first set of documents; determine whether the first probability that the structured document belongs to the first document class satisfies a threshold value; if the first probability that the structured document belongs to the first document class satisfies the threshold value, classifying the structured document as belonging to the first document class; and if the first probability that the structured document belongs to the first document class does not satisfy the threshold value, utilizing the classifier trained from a plurality of training documents labeled as belonging to the second document class to determine a second probability that the structured document belongs to the second document class based on the text string representing the structure of the structured document. 11. The non-transitory computer-readable storage medium of claim 10 , comprising further computer-executable instructions that cause

Assignees

Inventors

Classifications

  • Parsing markup language streams (streaming G06F40/149) · CPC title

  • G06F16/353Primary

    into predefined classes · CPC title

  • Indexing; Web crawling techniques · CPC title

  • G06F16/285Primary

    Clustering or classification · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9477756B1 cover?
Technologies are described herein for classifying structured documents based on the structure of the document. A structured document is received, and the structural elements are parsed from the document to generate a text string representing the structure of the document instead of the semantic textual content of the document. The text string may be broken into N-grams utilizing a sliding windo…
Who is the assignee on this patent?
Park Thomas Robert, Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/353. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 25 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).