Who is the assignee on this patent?

Park Thomas Robert, Amazon Tech Inc

What technology area does this patent fall under?

Primary CPC classification G06F16/353. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 25 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Classifying structured documents

US9477756B1 · US · B1

Patent metadata
Field	Value
Publication number	US-9477756-B1
Application number	US-201213350951-A
Country	US
Kind code	B1
Filing date	Jan 16, 2012
Priority date	Jan 16, 2012
Publication date	Oct 25, 2016
Grant date	Oct 25, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Technologies are described herein for classifying structured documents based on the structure of the document. A structured document is received, and the structural elements are parsed from the document to generate a text string representing the structure of the document instead of the semantic textual content of the document. The text string may be broken into N-grams utilizing a sliding window, and a classifier trained from similar structured documents labeled as belonging to one of a number of document classes is utilized to determine a probability that the document belongs to each of the document classes based on the N-grams.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method of classifying documents, the method comprising executing instructions in a computer system to perform operations comprising: receiving a hypertext markup language (HTML) document; parsing HTML tags from the HTML document; generating a text string comprising a word for each of the HTML tags; deriving metadata from the HTML document; adding the metadata to the text string, the metadata representing a complexity value of a structure of the HTML document based upon a maximum depth of nested HTML tags in the HTML document; breaking the text string into N-grams utilizing a sliding window; utilizing a naïve Bayesian classifier trained from similar structured documents labeled as belonging to a first document class to determine a first probability that the HTML document belongs to the first document class based at least in part upon the N-grams, the naïve Bayesian classifier using the complexity value of the structure of the HTML document as a coefficient in determination of the first probability; classifying the document as belonging to the first document class based at least in part on the first probability that the HTML document belonging to the first document class that satisfies a threshold condition; and in response to the first probability not satisfying the threshold condition, utilizing the naïve Bayesian classifier trained from similar structured documents labeled as belonging to a second document class to determine a second probability that the HTML document belongs to the second document class based at least in part on the N-grams, the naïve Bayesian classifier using the complexity value of the structure of the HTML document as the coefficient in determination of the second probability. 2. The computer-implemented method of claim 1 , wherein the metadata added to the text string comprises an identification of an author of the HTML document to be utilized by the naïve Bayesian classifier. 3. The computer-implemented method of claim 1 , wherein the HTML document and at least one of the similar structured documents comprise HTML documents generated from a same template. 4. A system for classifying documents, the system comprising: a computing device; and a document classification module executing on the computing device and configured to at least: train an N-gramming classifier from a plurality of training documents, each of the plurality of training documents labeled as belonging to one or a plurality of document classes; receive a structured document; parse structural elements from the structured document; generate a text string representing a structure of the structured document from the structural elements, wherein the N-gramming classifier breaks the text string into N-grams utilizing a sliding window; derive metadata from the structured document; add the metadata to the text string, the metadata representing a complexity value of the structure of the structured document to be utilized by the N-gramming classifier based upon a maximum depth of nested structural elements in the structured document; utilize the N-gramming classifier to classify the structured document as belonging to one of the plurality of document classes based on the text string representing the structure of the structured document, the N-gramming classifier using the complexity value of the structure of the structured document as a coefficient for a first probability that the structured document belongs to one of the plurality of document classes; determine whether the first probability that the structured document belongs to a first document class satisfies a threshold value; if the first probability that the structured document belongs to the first document class satisfies the threshold value, classifying the structured document as belonging to the first document class; and if the first probability that the structured document belongs to the first document class does not satisfy the threshold value, utilizing the classifier trained from a plurality of training documents labeled as belonging to a second document class to determine a second probability that the structured document belongs to the second document class based on the text string representing the structure of the structured document. 5. The system of claim 4 , wherein the N-gramming classifier comprises a naïve Bayesian classifier. 6. The system of claim 4 , wherein the structured document comprises a hypertext markup language (HTML) document and wherein the text string representing the structure of the structured document comprises a word for each of the HTML tags in the HTML document. 7. The system of claim 6 , wherein the text string representing the structure of the structured document further comprises a word for an attribute of at least one of the HTML tags in the HTML document. 8. The system of claim 6 , wherein the structured document comprises an HTML-based email message generated from a template. 9. The system of claim 4 , wherein the metadata added to the text string comprises an identification of an author of the structured document to be utilized by the N-gramming classifier in classifying the structured document. 10. A non-transitory computer-readable storage medium having computer-executable instructions stored thereon that, when executed by a computing device, cause the computing device to at least: receive a structured document; parse structural elements from the structured document; generate a text string representing a structure of the structured document from the structural elements, wherein the text string does not comprise textual content from the structured document; derive metadata from the structured document; add the metadata to the text string, the metadata representing a complexity value of the structure of the structured document based at least in part upon a maximum depth of nested structural elements in the structured document; group structural elements in the text string into N-grams utilizing a sliding window; utilize a classifier trained from a plurality of training documents labeled as belonging to a first document class to determine a probability that the structured document belongs to the first document class based on the N-grams, the classifier using the complexity value of the structure of the structured document as a coefficient for a first probability that the structured document belongs to the first document class, wherein the complexity value has a relatively lower value for a first set of documents that are similar and include a simple structure while the complexity value has a relatively higher value for a second set of documents that are relatively less similar and include a relatively more complex structure as compared to the first set of documents; determine whether the first probability that the structured document belongs to the first document class satisfies a threshold value; if the first probability that the structured document belongs to the first document class satisfies the threshold value, classifying the structured document as belonging to the first document class; and if the first probability that the structured document belongs to the first document class does not satisfy the threshold value, utilizing the classifier trained from a plurality of training documents labeled as belonging to the second document class to determine a second probability that the structured document belongs to the second document class based on the text string representing the structure of the structured document. 11. The non-transitory computer-readable storage medium of claim 10 , comprising further computer-executable instructions that cause

Assignees

Inventors

Park Thomas Robert

Classifications

G06F40/221
Parsing markup language streams (streaming G06F40/149) · CPC title
G06F16/353Primary
into predefined classes · CPC title
G06F16/951
Indexing; Web crawling techniques · CPC title
G06F16/285Primary
Clustering or classification · CPC title
G06F17/30707
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 57137424

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9477756B1 cover?: Technologies are described herein for classifying structured documents based on the structure of the document. A structured document is received, and the structural elements are parsed from the document to generate a text string representing the structure of the document instead of the semantic textual content of the document. The text string may be broken into N-grams utilizing a sliding windo…
Who is the assignee on this patent?: Park Thomas Robert, Amazon Tech Inc
What technology area does this patent fall under?: Primary CPC classification G06F16/353. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 25 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).