Generation of document classifiers
US-2018349388-A1 · Dec 6, 2018 · US
US11048870B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11048870-B2 |
| Application number | US-201715841703-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 14, 2017 |
| Priority date | Jun 7, 2017 |
| Publication date | Jun 29, 2021 |
| Grant date | Jun 29, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system and method performs automated domain concept discovery and clustering using word embeddings by receiving a set of documents for natural language processing for a domain, representing a plurality of entries in the set of documents as continuous vectors in a high dimensional continuous space, applying a clustering algorithm based on a mutual information optimization criterion to form a set of clusters, associating each entry of the plurality of entries with each cluster in the set of clusters through formalizing an evidence based model of each cluster given each entry, calculating a mutual information metric between each entry and each cluster using the evidence based model, and identifying a nominal center of each cluster by maximizing the mutual information.
Opening claim text (preview).
The invention claimed is: 1. A method to perform automated domain concept discovery and clustering, the method comprising: receiving a set of documents for natural language processing for a domain; representing a plurality of entries in the set of documents as a plurality of continuous vector representations in a high dimensional continuous space; applying a log-linear model algorithm to the plurality of continuous vector representations; applying a clustering algorithm to discover domain concepts based on a mutual information optimization criterion to form a set of clusters; associating each entry of the plurality of entries with each cluster in the set of clusters through formalizing an evidence based model of each cluster given each entry; calculating a mutual information between each entry and each cluster using the evidence based model; identifying a nominal center of each cluster by maximizing a corresponding calculated mutual information of that cluster; calculating an iterative derivative gradient with respect to the plurality of continuous vector representations via chain-rule backpropagation, wherein a number of iterations in calculating the iterative derivate gradient is controlled via a learning-rate hyperparameter; and moving the nominal center of each cluster along the iterative derivative gradient to thereby reduce a value of an objective function of the clustering algorithm via a pure gradient descent approach. 2. The method of claim 1 , wherein a continuous vector representation of the plurality of continuous vector representations uses a pre-trained word embedding. 3. The method of claim 2 , wherein the each entry is selected from a group consisting of an entity, a concept, and a relationship. 4. The method of claim 3 , wherein the group is communicated in a knowledge graph. 5. The method of claim 1 , wherein the documents include corpus or a body of works. 6. The method of claim 1 , wherein the entries include words or phrases. 7. The method of claim 1 , wherein the evidence based model includes a posterior probability model. 8. The method of claim 1 , wherein representing the entries includes extracting a unique word. 9. The method of claim 1 , wherein representing the entries includes mapping a word. 10. The method of claim 1 , further comprising adding an entity synonym. 11. The method of claim 1 , further comprising adding a slot synonym. 12. The method of claim 1 , wherein associating each entry further comprises mapping each cluster to each entry. 13. The method of claim 1 , further comprising identifying a theme based on the set of clusters. 14. The method of claim 1 , wherein identifying the nominal center of each cluster by maximizing the mutual information includes determining a cosine distance between each entity included in each cluster. 15. An apparatus comprising: a memory storing program code; and a processor configured to access the memory and execute the program code to receive a set of documents for natural language processing for a domain, represent a plurality of entries in the set of documents as continuous vectors in a high dimensional continuous space, apply a log-linear model algorithm to the continuous vectors, apply a clustering algorithm to discover domain concepts based on a mutual information optimization criterion to form a set of clusters, associate each entry of the plurality of entries with each cluster in the set of clusters through formalizing an evidence based model of each cluster given each entry, calculate a mutual information between each entry and each cluster using the evidence based model, identify a nominal center of each cluster in the set of clusters by maximizing a corresponding calculated mutual information of that cluster, calculate an iterative derivative gradient with respect to the plurality of continuous vector representations via chain-rule backpropagation, wherein a number of iterations in calculating the iterative derivate gradient is controlled via a learning-rate hyperparameter, and move the nominal center of each cluster along the iterative derivative gradient to thereby reduce a value of an objective function of the clustering algorithm via a pure gradient descent approach. 16. The apparatus of claim 15 , wherein the mutual information optimization criterion includes pointwise mutual information. 17. The apparatus of claim 15 , wherein the processor is further configured to extract a word included in the set of documents. 18. The apparatus of claim 17 , wherein the processor is further configured to compile a lexicon using the word. 19. A program product to perform automated domain concept discovery and clustering, the program product comprising: a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code being executable by a processor to receive a set of documents for natural language processing for a domain, represent a plurality of entries in the set of documents as continuous vectors in a high dimensional continuous space, apply a log-linear model algorithm to the continuous vectors, apply a clustering algorithm to discover domain concepts based on a mutual information optimization criterion to form a set of clusters, associate each entry of the plurality of entries with each cluster in the set of clusters through formalizing an evidence based model of each cluster given each entry, calculate a mutual information between each entry and each cluster using the evidence based model, identify a nominal center of each cluster in the set of clusters by maximizing a corresponding calculated mutual information of that cluster, calculate an iterative derivative gradient with respect to the plurality of continuous vector representations via chain-rule backpropagation, wherein a number of iterations in calculating the iterative derivate gradient is controlled via a learning-rate hyperparameter, and move the nominal center of each cluster along the iterative derivative gradient to thereby reduce a value of an objective function of the clustering algorithm via a pure gradient descent approach.
Semantic analysis · CPC title
Clustering; Classification · CPC title
Document management systems · CPC title
Querying (for retrieval from the web G06F16/953) · CPC title
Parsing · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.