What technology area does this patent fall under?

Primary CPC classification G06F40/242. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Nov 01 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Automated formation of specialized dictionaries

US9483460B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9483460-B2
Application number	US-201314047502-A
Country	US
Kind code	B2
Filing date	Oct 7, 2013
Priority date	Oct 7, 2013
Publication date	Nov 1, 2016
Grant date	Nov 1, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A document analysis system analyzes a corpus of documents and automatically generates a dictionary of specialized phrases not already in conventional dictionaries. The dictionary generation process involves a series of operations on the phrases to identify the phrases most suitable for inclusion in a dictionary, such as phrase scoring and phrase clustering. The dictionary generation process also comprises the identification of one or more corresponding definitions for the various phrases identified for inclusion in the specialized dictionary.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for automatically generating a specialized dictionary, comprising: extracting a plurality of potential phrases from a document corpus including a plurality of documents; assigning each of the plurality of potential phrases to a cluster from a plurality of clusters; identifying, for each of the plurality of clusters, a set of documents from the plurality of documents that include at least one potential phrase assigned to the cluster; determining, for each of the plurality of potential phrases, a score based on a number of documents that include the potential phrase from the set of documents identified for the cluster to which the potential phrase is assigned; selecting, from the plurality of potential phrases, potential phrases as dictionary phrases, each dictionary phrase selected based on the score determined for the dictionary phrase; extracting, for each dictionary phrase, a definition from the document corpus; and storing each dictionary phrase and the definition extracted for the dictionary phrase. 2. The computer-implemented method of claim 1 , further comprising extracting each of the plurality of the potential phrases based on whether the potential phrase is part of a linguistic pattern indicating a definition of the potential phrase. 3. The computer-implemented method of claim 1 , further comprising extracting each of the plurality of the potential phrases based on capitalization of the potential phrase. 4. The computer-implemented method of claim 1 , further comprising extracting each of the plurality of the potential phrases based on parts of speech of the potential phrases. 5. The computer-implemented method of claim 1 , wherein assigning each of the plurality of potential phrases to a cluster phrase clusters comprises determining co-occurrences of potential phrases from the plurality of potential phrases. 6. The computer-implemented method of claim 1 , wherein assigning each of the plurality of potential phrases to a cluster comprises at least one of: determining similarity of concepts represented by the plurality of potential phrases, and determining capitalization of the plurality of potential phrases. 7. The computer-implemented method of claim 1 , wherein assigning each of the plurality of potential phrases to a cluster comprises determining co-occurrences of a pair of potential phrases from the plurality of potential phrases within the document corpus and determining similarity of concepts represented by the pair of potential phrases, the method further comprising: forming a similarity measure between the pair of potential phrases using the determined co-occurrences and the determined similarity of concepts; and applying machine learning to determine values by which to weight the determined co-occurrences and the determined similarity of concepts in order to produce the similarity measure. 8. The computer-implemented method of claim 1 , further comprising determining whether a first document in the document corpus is a dictionary, wherein assigning each of the plurality of potential phrases to a cluster comprises determining whether the potential phrase is present in the first document. 9. The computer-implemented method of claim 8 , wherein determining whether the first document is a dictionary comprises applying a dictionary model template to the first document, the dictionary model template specifying a plurality of formatting properties of documents characteristic of dictionaries. 10. The computer-implemented method of claim 1 , wherein selecting potential phrases as dictionary phrases comprises identifying positions of occurrences of the dictionary phrases within the plurality of documents. 11. A tangible computer-readable storage medium storing instructions that when executed by a processor cause the processor to perform steps comprising: extracting a plurality of potential phrases from a document corpus including a plurality of documents; assigning each of the plurality of potential phrases into a to a cluster from a plurality of clusters; identifying, for each of the plurality of clusters, a set of documents from the plurality of documents that include at least one potential phrase assigned to the cluster; determining, for each of the plurality of potential phrases, a score based on a number of documents that include the potential phrase from the set of documents identified for the cluster to which the potential phrase is assigned; selecting, from the plurality of potential phrases, potential phrases as dictionary phrases, each dictionary phrase selected based on the score determined for the dictionary phrase; extracting, for each dictionary phrase, a definition from the document corpus; and storing each dictionary phrase and the definition extracted for the selected dictionary phrase. 12. The computer-readable storage medium of claim 11 , wherein extracting the plurality of the potential phrases comprises identifying ages of documents of the document corpus in which the plurality of the potential phrases occur. 13. The computer-readable storage medium of claim 11 , wherein extracting the definition for a dictionary phrase comprises determining whether the dictionary phrase is part of a linguistic pattern indicating the definition of the dictionary phrase. 14. The computer-readable storage medium of claim 11 , wherein the instructions further cause the processor to perform steps comprising: receiving a request to define a phrase associated with an electronic book; identifying the requested phrase within the dictionary phrases stored; and providing the stored definition associated with the requested phrase. 15. The computer-readable storage medium of claim 11 , wherein assigning each of the plurality of potential phrases to a cluster comprises determining co-occurrences of a pair of potential phrases from the plurality of potential phrases within the document corpus and determining similarity of concepts represented by the pair of potential phrases, the method further comprising: forming a similarity measure between the pair of potential phrases using the determined co-occurrences and the determined similarity of concepts; and applying machine learning to determine values by which to weight the determined co-occurrences and the determined similarity of concepts in order to produce the similarity measure. 16. The computer-readable storage medium of claim 11 , further comprising extracting each of the plurality of the potential phrases based on whether the potential phrase is part of a linguistic pattern indicating a definition of the potential phrase. 17. A computing device comprising: a computer processor; and a tangible computer-readable storage medium storing instructions executed by the computer processor to perform steps comprising: extracting a plurality of potential phrases from a document corpus including a plurality of documents; assigning each of the plurality of potential phrases to a cluster from a plurality of clusters; identifying, for each of the plurality of clusters, a set of documents from the plurality of documents that include at least one potential phrase assigned to the cluster; determining, for each of the plurality of potential phrases, a score based on a number of documents that include the potential phrase from the set of documents identified for the cluster to which the potential phrase is assigned; selecting, from the plurality of potential phrases, potential phrases as dictionary phrases, each dictionary phrase selected based on the score deter

Assignees

Google Inc

Inventors

Classifications

G06F40/284
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06F40/247
Thesauruses; Synonyms · CPC title
G06F40/242Primary
Dictionaries · CPC title
G06F17/2735Primary
Physics · mapped topic
G06F17/277
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 52777641

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9483460B2 cover?: A document analysis system analyzes a corpus of documents and automatically generates a dictionary of specialized phrases not already in conventional dictionaries. The dictionary generation process involves a series of operations on the phrases to identify the phrases most suitable for inclusion in a dictionary, such as phrase scoring and phrase clustering. The dictionary generation process als…
Who is the assignee on this patent?: Google Inc
What technology area does this patent fall under?: Primary CPC classification G06F40/242. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Nov 01 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).