What technology area does this patent fall under?

Primary CPC classification G06N20/00. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 26 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Identifying longform articles

US9773166B1 · US · B1

Patent metadata
Field	Value
Publication number	US-9773166-B1
Application number	US-201514931576-A
Country	US
Kind code	B1
Filing date	Nov 3, 2015
Priority date	Nov 3, 2014
Publication date	Sep 26, 2017
Grant date	Sep 26, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for classifying documents. One of the methods includes obtaining a collection of training documents, the training documents including positive documents identified as being longform documents and negative documents identified as not being longform documents; extracting one or more features from the training documents, wherein the features represent lexical or textual content of the training documents; and generating a longform document classifier trained using feature instances extracted from the training documents, wherein the generated longform document classifier is trained such that input documents are classified as being longform documents or classified as not being longform documents.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: obtaining a collection of training documents, the training documents including a group of positive documents and a group of negative documents, wherein the positive documents are training documents identified as being longform documents and the negative documents are training documents identified as not being longform documents; extracting a plurality of features from the training documents, wherein the plurality of features are associated with a plurality of different feature types that represent lexical or textual content of the training documents that are indicative of a document's writing style; generating a longform document classifier trained using feature instances extracted from the training documents, wherein the generated longform document classifier is trained such that input documents are classified as being longform documents or classified as not being longform documents; applying the longform document classifier to a corpus of documents; annotating an information retrieval index with an output classification for each document of the corpus of documents; and using the annotated index to provide search results identifying longform documents in response to a search query. 2. The method of claim 1 , further comprising evaluating the generated longform document classifier using a group of sample documents having known classifications. 3. The method of claim 1 , wherein the one or more features include a parse n-gram feature that indicates common sentence structures in the documents based on dependency parse trees. 4. The method of claim 1 , wherein the one or more features include a parts of speech n-gram feature that indicates aspects of common sentence structures in the documents. 5. The method of claim 1 , wherein the one or more features include a linear parse n-gram feature that extracts parse tags for a sequence of tokens. 6. The method of claim 1 , wherein the one or more features include a pronoun person frequency feature that identifies a relative frequency of first, second, and third person pronouns among all pronouns in a given document. 7. The method of claim 1 , wherein the one or more features include a pronoun person frequency feature that identifies a frequency of pronouns in a given document. 8. The method of claim 1 , wherein the one or more features include a punctuation frequency that identifies a relative frequency of different punctuation types in a given document. 9. A system comprising: one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: obtaining a collection of training documents, the training documents including a group of positive documents and a group of negative documents, wherein the positive documents are training documents identified as being longform documents and the negative documents are training documents identified as not being longform documents; extracting a plurality of features from the training documents, wherein the plurality of features are associated with a plurality of different feature types that represent lexical or textual content of the training documents that are indicative of a document's writing style; generating a longform document classifier trained using feature instances extracted from the training documents, wherein the generated longform document classifier is trained such that input documents are classified as being longform documents or classified as not being longform documents; applying the longform document classifier to a corpus of documents; annotating an information retrieval index with an output classification for each document of the corpus of documents; and using the annotated index to provide search results identifying longform documents in response to a search query. 10. The system of claim 9 , further operable to perform operations comprising evaluating the generated longform document classifier using a group of sample documents having known classifications. 11. The system of claim 9 , wherein the one or more features include a parse n-gram feature that indicates common sentence structures in the documents based on dependency parse trees. 12. The system of claim 9 , wherein the one or more features include a parts of speech n-gram feature that indicates aspects of common sentence structures in the documents. 13. The system of claim 9 , wherein the one or more features include a linear parse n-gram feature that extracts parse tags for a sequence of tokens. 14. The system of claim 9 , wherein the one or more features include a pronoun person frequency feature that identifies a relative frequency of first, second, and third person pronouns among all pronouns in a given document. 15. The system of claim 9 , wherein the one or more features include a pronoun person frequency feature that identifies a frequency of pronouns in a given document. 16. The system of claim 9 , wherein the one or more features include a punctuation frequency that identifies a relative frequency of different punctuation types in a given document. 17. One or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising: obtaining a collection of training documents, the training documents including a group of positive documents and a group of negative documents, wherein the positive documents are training documents identified as being longform documents and the negative documents are training documents identified as not being longform documents; extracting a plurality of features from the training documents, wherein the plurality of features are associated with a plurality of different feature types that represent lexical or textual content of the training documents that are indicative of a document's writing style; generating a longform document classifier trained using feature instances extracted from the training documents, wherein the generated longform document classifier is trained such that input documents are classified as being longform documents or classified as not being longform documents; applying the longform document classifier to a corpus of documents; annotating an information retrieval index with an output classification for each document of the corpus of documents; and using the annotated index to provide search results identifying longform documents in response to a search query. 18. The one or more non-transitory computer readable media of claim 17 , wherein the one or more features include a parse n-gram feature that indicates common sentence structures in the documents based on dependency parse trees. 19. The one or more non-transitory computer readable media of claim 17 , wherein the one or more features include a parts of speech n-gram feature that indicates aspects of common sentence structures in the documents.

Assignees

Google Inc

Inventors

Classifications

G06F16/35
Clustering; Classification · CPC title
G06N20/00Primary
Machine learning · CPC title
G06F40/211Primary
Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars · CPC title
G06K9/00469Primary
Physics · mapped topic
G06K9/66
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 59886571

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9773166B1 cover?: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for classifying documents. One of the methods includes obtaining a collection of training documents, the training documents including positive documents identified as being longform documents and negative documents identified as not being longform documents; extracting one or more features from the t…
Who is the assignee on this patent?: Google Inc
What technology area does this patent fall under?: Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 26 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).