Expert knowledge platform
US-2019057310-A1 · Feb 21, 2019 · US
US11636355B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11636355-B2 |
| Application number | US-201916427225-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 30, 2019 |
| Priority date | May 30, 2019 |
| Publication date | Apr 25, 2023 |
| Grant date | Apr 25, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Leveraging domain knowledge is an effective strategy for enhancing the quality of inferred low-dimensional representations of documents by topic models. Presented herein are embodiments of a Bayesian nonparametric model that employ knowledge graph (KG) embedding in the context of topic modeling for extracting more coherent topics; embodiments of the model may be referred to as topic modeling with knowledge graph embedding (TMKGE). TMKGE embodiments are hierarchical Dirichlet process (HDP)-based models that flexibly borrow information from a KG to improve the interpretability of topics. Also, embodiments of a new, efficient online variational inference method based on a stick-breaking construction of HDP were developed for TMKGE models, making TMKGE suitable for large document corpora and KGs. Experiments on datasets illustrate the superior performance of TMKGE in terms of topic coherence and document classification accuracy, compared to state-of-the-art topic modeling methods.
Opening claim text (preview).
What is claimed is: 1. A computer-implement method for determining latent topics for a corpus of documents, the method comprising: given a corpus of documents in which each document comprises words and entities: using entity embeddings obtained from a knowledge graph to represent entities in the corpus of documents; and for each document in the corpus of documents, generating a word frequency representation for words from the document; using the entity embeddings and the word frequency representations as an input to a topic model to generate latent topics for the corpus of documents, the topic model comprising: a corpus-level Dirichlet process that uses the word frequency representations and the entity embeddings to obtain a shared base measure that is used as a prior for two document-level Dirichlet processes; a first document-level Dirichlet process that uses the shared base measure as a prior to generate a latent topic distribution for words; a second document-level Dirichlet process that uses the shared base measure as a prior to generate a latent topic distribution for entity embeddings; a word generative process that uses the latent topic distribution for words and a word-level Dirichlet process to assign latent topics to words; and using the latent topic distribution for entity embeddings and a distribution to assign latent topics to entity embeddings; and assigning one or more topics to one or more documents from the corpus of documents using at least one or more of the latent topics generated by the topic model. 2. The computer-implement method of claim 1 wherein each atom of the corpus-level Dirichlet process corresponds to a set of parameters for both words and entities. 3. The computer-implement method of claim 1 wherein model parameters for the topic model are learned using an online variational inference methodology. 4. The computer-implement method of claim 3 wherein the step of using an online variational inference methodology to learn the model parameters for the topic model comprises: initialize corpus-level variational parameters of the topic model; and iteratively performing the following steps until a stop condition has been met: sampling a document at random from the corpus of documents; updating document-level variational parameters of the topic model; and updating corpus-level parameters. 5. The computer-implement method of claim 4 wherein the steps of updating document-level variational parameters of the topic model, and updating corpus-level parameters comprise the steps of: updating word-related variational parameters of the topic model; updating entity-related variational parameters of the topic model; computing natural gradients using word-related variational parameters and entity-related variational parameters; updating a learning rate parameter; and updating corpus-level parameters of the topic model using the natural gradients and the learning rate parameter. 6. The computer-implement method of claim 5 wherein word-related variational parameters and entity-related variational parameters from a batch of documents are used to compute the natural gradients for the topic model to improve stability of the online variational inference methodology. 7. The computer-implement method of claim 1 further comprising the step of: given a set of topic model parameters, using the topic model to generate words for a document. 8. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one processor, causes steps to be performed comprising: given a corpus of documents in which each document comprises words and entities: using entity embeddings obtained from a knowledge graph to represent entities in the corpus of documents; and for each document in the corpus of documents, generating a word frequency representation for words from the document; using the entity embeddings and the word frequency representations as an input to a topic model to generate latent topics for the corpus of documents, the topic model comprising: a corpus-level Dirichlet process that uses the word frequency representations and the entity embeddings to obtain a shared base measure that is used as a prior for two document-level Dirichlet processes; a first document-level Dirichlet process that uses the shared base measure as a prior to generate a latent topic distribution for words; a second document-level Dirichlet process that uses the shared base measure as a prior to generate a latent topic distribution for entity embeddings; a word generative process that uses the latent topic distribution for words and a word-level Dirichlet process to assign latent topics to words; and using the latent topic distribution for entity embeddings and a distribution to assign latent topics to entity embeddings; and assigning one or more topics to one or more documents from the corpus of documents using at least one or more of the latent topics generated by the topic model. 9. The non-transitory computer-readable medium or media of claim 8 wherein each atom of the corpus-level Dirichlet process corresponds to a set of parameters for both words and entities. 10. The non-transitory computer-readable medium or media of claim 8 wherein model parameters for the topic model are learned using an online variational inference methodology. 11. The non-transitory computer-readable medium or media of claim 10 wherein the step of using an online variational inference methodology to learn the model parameters for the topic model comprises: initialize corpus-level variational parameters of the topic model; and iteratively performing the following steps until a stop condition has been met: sampling a document at random from the corpus of documents; updating document-level variational parameters of the topic model; and updating corpus-level parameters. 12. The non-transitory computer-readable medium or media of claim 11 wherein the steps of updating document-level variational parameters of the topic model, and updating corpus-level parameters comprise the steps of: updating word-related variational parameters of the topic model; updating entity-related variational parameters of the topic model; computing natural gradients using word-related variational parameters and entity-related variational parameters; updating a learning rate parameter; and updating corpus-level parameters of the topic model using the natural gradients and the learning rate parameter. 13. The non-transitory computer-readable medium or media of claim 12 wherein word-related variational parameters and entity-related variational parameters from a batch of documents are used to compute the natural gradients for the topic model to improve stability of the online variational inference methodology. 14. The non-transitory computer-readable medium or media of claim 8 further comprising one or more sequences of instructions which, when executed by at least one processor, causes steps to be performed comprising: given a set of topic model parameters, using the topic model to generate words for a document. 15. A computing system comprising: at least one processor; and a non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: given a corpus of documents in which each document comprises words and entities: using entity embeddings obtained from a knowledge graph to represent entities in the corpus of docu
Natural language analysis (semantic analysis of natural language G06F40/30) · CPC title
Semantic analysis · CPC title
Ontology · CPC title
Machine learning · CPC title
Inference or reasoning models · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.