Identifying key terms related to similar passages

US9323827B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9323827-B2
Application numberUS-2284208-A
CountryUS
Kind codeB2
Filing dateJan 30, 2008
Priority dateJul 20, 2007
Publication dateApr 26, 2016
Grant dateApr 26, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Key terms for similar passages from a large corpus are identified and used to enhance searching and browsing the corpus. The corpus contains multiple documents such as the text of books. Browsing by concept is supported by identifying a set of similar passages or quotations in documents stored in the corpus and assigning key terms to passages which links conceptually related passages together. The context of each passage instance is identified and can include, for example, the text surrounding the passage. The contexts of all similar passage instances are analyzed in order to identify key terms for the similar passage. The related key terms are analyzed to identify relationships among the key terms from different similar passage sets. The key terms can be used as a basis for navigating the documents in the corpus. The key terms enable browsing the documents in the corpus by concepts referenced in the documents.

First claim

Opening claim text (preview).

We claim: 1. A computer-implemented method of identifying at least one key term related to a similar passage, comprising: identifying a plurality of documents stored in a corpus, wherein each identified document contains an instance of the similar passage; for each similar passage instance within the identified documents, extracting each word that appears within a threshold number of words before the similar passage instance within an identified document and each word that appears within a threshold number of words after the similar passage instance within the identified document, the extracted words associated with the similar passage instance; combining the extracted words associated with each similar passage instance to form a context aggregation; determining a plurality of key terms related to the similar passage based on the context aggregation, each key term associated with a subset of the similar passage instances, at least one key term determined by comparing words within the context aggregation to a terms database specifying possible key terms and extracting a word within the context aggregation that matches a term in the terms database; presenting each of one or more key terms as a hyperlink in a user interface; receiving a selection of a key term presented as a hyperlink; and presenting the subset of similar passage instances associated with the selected key term in the user interface. 2. The method of claim 1 , wherein determining at least one key term comprises performing a TF-IDF analysis of the context aggregation to determine the at least one key term. 3. The method of claim 1 , wherein determining at least one key term comprises: generating candidate n-grams based on the context aggregation; and performing a TF-IDF analysis of the candidate n-grams to determine the at least one key term. 4. The method of claim 1 , further comprising: combining words from metadata describing individual ones of the plurality of documents containing instances of the similar passage with the context aggregation; wherein determining at least one key term related to the similar passage is based at least in part on the metadata. 5. The method of claim 1 , wherein first and second key terms are determined for the similar passage, further comprising: determining a relationship between the first and second key terms of the similar passage. 6. The method of claim 5 , wherein there exists a plurality of other similar passages, with each other similar passage having an associated set of key terms, and determining a relationship comprises: determining whether the first and second key terms are co-located in a set of key terms associated with another similar passage; declaring that the first and second key terms of the similar passage are related responsive to a positive determination that the first and second key terms are co-located in a set of key terms associated with the other similar passage. 7. The method of claim 1 , wherein the extracting comprises: identifying a pre-context for the similar passage instance comprising the words appearing within the threshold number of words before the similar passage instance; identifying a post-context for the similar passage instance comprising the words appearing within the threshold number of words after the similar passage instance; and forming a context associated with the similar passage instance by combining the pre-context and the post-context for the similar passage instance; wherein combining the extracted words comprises combining a plurality of contexts associated with a plurality of instances of the similar passage. 8. The method of claim 1 , wherein the threshold number of words before the similar passage instance is different than the threshold number of words after the similar passage instance. 9. The method of claim 1 , further comprising: determining a plurality of key terms related to the similar passage based on the context aggregation; assigning scores to the plurality of key terms; selecting a subset of the plurality of key terms responsive to the assigned scores; and presenting for display the selected subset of the plurality of key terms in association with the similar passage. 10. The method of claim 1 , wherein presenting a key term as a hyperlink comprises presenting text associated with the key term and presenting a number of similar passage instances in the subset of similar passage instances associated with the key term. 11. The method of claim 1 , wherein the subset of similar passage instances associated with the selected key term comprises less than all similar passage instances. 12. A non-transitory computer-readable storage medium containing executable program code for identifying at least one key term related to a similar passage, comprising: program code for identifying a plurality of documents stored in a corpus, wherein each identified document contains an instance of the similar passage; program code for, for each similar passage instance within the identified documents, extracting each word that appears within a threshold number of words before the similar passage instance within an identified document and each word that appears within a threshold number of words after the similar passage instance within the identified document, the extracted words associated with the similar passage instance; program code for combining the extracted words associated with each similar passage instance to form a context aggregation; program code for determining a plurality of key terms related to the similar passage based on the context aggregation, each key term associated with a subset of the similar passage instances, at least one key term determined by comparing words within the context aggregation to a terms database specifying possible key terms and extracting a word within the context aggregation that matches a term in the terms database; program code for presenting each of one or more key terms as a hyperlink in a user interface; program code for receiving a selection of a key term presented as a hyperlink; and program code for presenting the subset of similar passage instances associated with the selected key term in the user interface. 13. The non-transitory computer-readable storage medium of claim 12 , wherein the program code for determining at least one key term further comprises: program code for performing a TF-IDF analysis of the context aggregation to determine the at least one key term. 14. The non-transitory computer-readable storage medium of claim 12 , wherein the program code for determining at least one key term further comprises: program code for generating candidate n-grams based on the context aggregation; and program code for performing a TF-IDF analysis of the candidate n-grams to determine the at least one key term. 15. The non-transitory computer-readable storage medium of claim 12 , further comprising: program code for combining words from metadata describing individual ones of the plurality of documents containing instances of the similar passage with the context aggregation; wherein determining at least one key term related to the similar passage is based at least in part on the metadata. 16. The non-transitory computer-readable storage medium of claim 12 , wherein first and second key terms are determined for the similar passage, further comprising: program code for determining a relationship between the first and second key terms of the similar passage. 17. The non-transitory computer-readable storage medium of claim 16 , wherein there exist

Assignees

Inventors

Classifications

  • using natural language analysis · CPC title

  • G06F16/313Primary

    Selection or weighting of terms for indexing · CPC title

  • Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars · CPC title

  • Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9323827B2 cover?
Key terms for similar passages from a large corpus are identified and used to enhance searching and browsing the corpus. The corpus contains multiple documents such as the text of books. Browsing by concept is supported by identifying a set of similar passages or quotations in documents stored in the corpus and assigning key terms to passages which links conceptually related passages together. …
Who is the assignee on this patent?
Schilit William N, Kolak Okan, Google Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/313. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 26 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).