System and Method for Parsing Regulatory and Other Documents for Machine Scoring Background
US-2024296188-A1 · Sep 5, 2024 · US
US9483460B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9483460-B2 |
| Application number | US-201314047502-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 7, 2013 |
| Priority date | Oct 7, 2013 |
| Publication date | Nov 1, 2016 |
| Grant date | Nov 1, 2016 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A document analysis system analyzes a corpus of documents and automatically generates a dictionary of specialized phrases not already in conventional dictionaries. The dictionary generation process involves a series of operations on the phrases to identify the phrases most suitable for inclusion in a dictionary, such as phrase scoring and phrase clustering. The dictionary generation process also comprises the identification of one or more corresponding definitions for the various phrases identified for inclusion in the specialized dictionary.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for automatically generating a specialized dictionary, comprising: extracting a plurality of potential phrases from a document corpus including a plurality of documents; assigning each of the plurality of potential phrases to a cluster from a plurality of clusters; identifying, for each of the plurality of clusters, a set of documents from the plurality of documents that include at least one potential phrase assigned to the cluster; determining, for each of the plurality of potential phrases, a score based on a number of documents that include the potential phrase from the set of documents identified for the cluster to which the potential phrase is assigned; selecting, from the plurality of potential phrases, potential phrases as dictionary phrases, each dictionary phrase selected based on the score determined for the dictionary phrase; extracting, for each dictionary phrase, a definition from the document corpus; and storing each dictionary phrase and the definition extracted for the dictionary phrase. 2. The computer-implemented method of claim 1 , further comprising extracting each of the plurality of the potential phrases based on whether the potential phrase is part of a linguistic pattern indicating a definition of the potential phrase. 3. The computer-implemented method of claim 1 , further comprising extracting each of the plurality of the potential phrases based on capitalization of the potential phrase. 4. The computer-implemented method of claim 1 , further comprising extracting each of the plurality of the potential phrases based on parts of speech of the potential phrases. 5. The computer-implemented method of claim 1 , wherein assigning each of the plurality of potential phrases to a cluster phrase clusters comprises determining co-occurrences of potential phrases from the plurality of potential phrases. 6. The computer-implemented method of claim 1 , wherein assigning each of the plurality of potential phrases to a cluster comprises at least one of: determining similarity of concepts represented by the plurality of potential phrases, and determining capitalization of the plurality of potential phrases. 7. The computer-implemented method of claim 1 , wherein assigning each of the plurality of potential phrases to a cluster comprises determining co-occurrences of a pair of potential phrases from the plurality of potential phrases within the document corpus and determining similarity of concepts represented by the pair of potential phrases, the method further comprising: forming a similarity measure between the pair of potential phrases using the determined co-occurrences and the determined similarity of concepts; and applying machine learning to determine values by which to weight the determined co-occurrences and the determined similarity of concepts in order to produce the similarity measure. 8. The computer-implemented method of claim 1 , further comprising determining whether a first document in the document corpus is a dictionary, wherein assigning each of the plurality of potential phrases to a cluster comprises determining whether the potential phrase is present in the first document. 9. The computer-implemented method of claim 8 , wherein determining whether the first document is a dictionary comprises applying a dictionary model template to the first document, the dictionary model template specifying a plurality of formatting properties of documents characteristic of dictionaries. 10. The computer-implemented method of claim 1 , wherein selecting potential phrases as dictionary phrases comprises identifying positions of occurrences of the dictionary phrases within the plurality of documents. 11. A tangible computer-readable storage medium storing instructions that when executed by a processor cause the processor to perform steps comprising: extracting a plurality of potential phrases from a document corpus including a plurality of documents; assigning each of the plurality of potential phrases into a to a cluster from a plurality of clusters; identifying, for each of the plurality of clusters, a set of documents from the plurality of documents that include at least one potential phrase assigned to the cluster; determining, for each of the plurality of potential phrases, a score based on a number of documents that include the potential phrase from the set of documents identified for the cluster to which the potential phrase is assigned; selecting, from the plurality of potential phrases, potential phrases as dictionary phrases, each dictionary phrase selected based on the score determined for the dictionary phrase; extracting, for each dictionary phrase, a definition from the document corpus; and storing each dictionary phrase and the definition extracted for the selected dictionary phrase. 12. The computer-readable storage medium of claim 11 , wherein extracting the plurality of the potential phrases comprises identifying ages of documents of the document corpus in which the plurality of the potential phrases occur. 13. The computer-readable storage medium of claim 11 , wherein extracting the definition for a dictionary phrase comprises determining whether the dictionary phrase is part of a linguistic pattern indicating the definition of the dictionary phrase. 14. The computer-readable storage medium of claim 11 , wherein the instructions further cause the processor to perform steps comprising: receiving a request to define a phrase associated with an electronic book; identifying the requested phrase within the dictionary phrases stored; and providing the stored definition associated with the requested phrase. 15. The computer-readable storage medium of claim 11 , wherein assigning each of the plurality of potential phrases to a cluster comprises determining co-occurrences of a pair of potential phrases from the plurality of potential phrases within the document corpus and determining similarity of concepts represented by the pair of potential phrases, the method further comprising: forming a similarity measure between the pair of potential phrases using the determined co-occurrences and the determined similarity of concepts; and applying machine learning to determine values by which to weight the determined co-occurrences and the determined similarity of concepts in order to produce the similarity measure. 16. The computer-readable storage medium of claim 11 , further comprising extracting each of the plurality of the potential phrases based on whether the potential phrase is part of a linguistic pattern indicating a definition of the potential phrase. 17. A computing device comprising: a computer processor; and a tangible computer-readable storage medium storing instructions executed by the computer processor to perform steps comprising: extracting a plurality of potential phrases from a document corpus including a plurality of documents; assigning each of the plurality of potential phrases to a cluster from a plurality of clusters; identifying, for each of the plurality of clusters, a set of documents from the plurality of documents that include at least one potential phrase assigned to the cluster; determining, for each of the plurality of potential phrases, a score based on a number of documents that include the potential phrase from the set of documents identified for the cluster to which the potential phrase is assigned; selecting, from the plurality of potential phrases, potential phrases as dictionary phrases, each dictionary phrase selected based on the score deter
Lexical analysis, e.g. tokenisation or collocates · CPC title
Thesauruses; Synonyms · CPC title
Dictionaries · CPC title
Physics · mapped topic
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.