Method for book pushing, method for generating book recommendation text, apparatus, and electronic device
US-2024202260-A1 · Jun 20, 2024 · US
US9477756B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9477756-B1 |
| Application number | US-201213350951-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jan 16, 2012 |
| Priority date | Jan 16, 2012 |
| Publication date | Oct 25, 2016 |
| Grant date | Oct 25, 2016 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Technologies are described herein for classifying structured documents based on the structure of the document. A structured document is received, and the structural elements are parsed from the document to generate a text string representing the structure of the document instead of the semantic textual content of the document. The text string may be broken into N-grams utilizing a sliding window, and a classifier trained from similar structured documents labeled as belonging to one of a number of document classes is utilized to determine a probability that the document belongs to each of the document classes based on the N-grams.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method of classifying documents, the method comprising executing instructions in a computer system to perform operations comprising: receiving a hypertext markup language (HTML) document; parsing HTML tags from the HTML document; generating a text string comprising a word for each of the HTML tags; deriving metadata from the HTML document; adding the metadata to the text string, the metadata representing a complexity value of a structure of the HTML document based upon a maximum depth of nested HTML tags in the HTML document; breaking the text string into N-grams utilizing a sliding window; utilizing a naïve Bayesian classifier trained from similar structured documents labeled as belonging to a first document class to determine a first probability that the HTML document belongs to the first document class based at least in part upon the N-grams, the naïve Bayesian classifier using the complexity value of the structure of the HTML document as a coefficient in determination of the first probability; classifying the document as belonging to the first document class based at least in part on the first probability that the HTML document belonging to the first document class that satisfies a threshold condition; and in response to the first probability not satisfying the threshold condition, utilizing the naïve Bayesian classifier trained from similar structured documents labeled as belonging to a second document class to determine a second probability that the HTML document belongs to the second document class based at least in part on the N-grams, the naïve Bayesian classifier using the complexity value of the structure of the HTML document as the coefficient in determination of the second probability. 2. The computer-implemented method of claim 1 , wherein the metadata added to the text string comprises an identification of an author of the HTML document to be utilized by the naïve Bayesian classifier. 3. The computer-implemented method of claim 1 , wherein the HTML document and at least one of the similar structured documents comprise HTML documents generated from a same template. 4. A system for classifying documents, the system comprising: a computing device; and a document classification module executing on the computing device and configured to at least: train an N-gramming classifier from a plurality of training documents, each of the plurality of training documents labeled as belonging to one or a plurality of document classes; receive a structured document; parse structural elements from the structured document; generate a text string representing a structure of the structured document from the structural elements, wherein the N-gramming classifier breaks the text string into N-grams utilizing a sliding window; derive metadata from the structured document; add the metadata to the text string, the metadata representing a complexity value of the structure of the structured document to be utilized by the N-gramming classifier based upon a maximum depth of nested structural elements in the structured document; utilize the N-gramming classifier to classify the structured document as belonging to one of the plurality of document classes based on the text string representing the structure of the structured document, the N-gramming classifier using the complexity value of the structure of the structured document as a coefficient for a first probability that the structured document belongs to one of the plurality of document classes; determine whether the first probability that the structured document belongs to a first document class satisfies a threshold value; if the first probability that the structured document belongs to the first document class satisfies the threshold value, classifying the structured document as belonging to the first document class; and if the first probability that the structured document belongs to the first document class does not satisfy the threshold value, utilizing the classifier trained from a plurality of training documents labeled as belonging to a second document class to determine a second probability that the structured document belongs to the second document class based on the text string representing the structure of the structured document. 5. The system of claim 4 , wherein the N-gramming classifier comprises a naïve Bayesian classifier. 6. The system of claim 4 , wherein the structured document comprises a hypertext markup language (HTML) document and wherein the text string representing the structure of the structured document comprises a word for each of the HTML tags in the HTML document. 7. The system of claim 6 , wherein the text string representing the structure of the structured document further comprises a word for an attribute of at least one of the HTML tags in the HTML document. 8. The system of claim 6 , wherein the structured document comprises an HTML-based email message generated from a template. 9. The system of claim 4 , wherein the metadata added to the text string comprises an identification of an author of the structured document to be utilized by the N-gramming classifier in classifying the structured document. 10. A non-transitory computer-readable storage medium having computer-executable instructions stored thereon that, when executed by a computing device, cause the computing device to at least: receive a structured document; parse structural elements from the structured document; generate a text string representing a structure of the structured document from the structural elements, wherein the text string does not comprise textual content from the structured document; derive metadata from the structured document; add the metadata to the text string, the metadata representing a complexity value of the structure of the structured document based at least in part upon a maximum depth of nested structural elements in the structured document; group structural elements in the text string into N-grams utilizing a sliding window; utilize a classifier trained from a plurality of training documents labeled as belonging to a first document class to determine a probability that the structured document belongs to the first document class based on the N-grams, the classifier using the complexity value of the structure of the structured document as a coefficient for a first probability that the structured document belongs to the first document class, wherein the complexity value has a relatively lower value for a first set of documents that are similar and include a simple structure while the complexity value has a relatively higher value for a second set of documents that are relatively less similar and include a relatively more complex structure as compared to the first set of documents; determine whether the first probability that the structured document belongs to the first document class satisfies a threshold value; if the first probability that the structured document belongs to the first document class satisfies the threshold value, classifying the structured document as belonging to the first document class; and if the first probability that the structured document belongs to the first document class does not satisfy the threshold value, utilizing the classifier trained from a plurality of training documents labeled as belonging to the second document class to determine a second probability that the structured document belongs to the second document class based on the text string representing the structure of the structured document. 11. The non-transitory computer-readable storage medium of claim 10 , comprising further computer-executable instructions that cause
Parsing markup language streams (streaming G06F40/149) · CPC title
into predefined classes · CPC title
Indexing; Web crawling techniques · CPC title
Clustering or classification · CPC title
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.