Integration of knowledge graph embedding into topic modeling with hierarchical Dirichlet process

US11636355B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11636355-B2
Application numberUS-201916427225-A
CountryUS
Kind codeB2
Filing dateMay 30, 2019
Priority dateMay 30, 2019
Publication dateApr 25, 2023
Grant dateApr 25, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Leveraging domain knowledge is an effective strategy for enhancing the quality of inferred low-dimensional representations of documents by topic models. Presented herein are embodiments of a Bayesian nonparametric model that employ knowledge graph (KG) embedding in the context of topic modeling for extracting more coherent topics; embodiments of the model may be referred to as topic modeling with knowledge graph embedding (TMKGE). TMKGE embodiments are hierarchical Dirichlet process (HDP)-based models that flexibly borrow information from a KG to improve the interpretability of topics. Also, embodiments of a new, efficient online variational inference method based on a stick-breaking construction of HDP were developed for TMKGE models, making TMKGE suitable for large document corpora and KGs. Experiments on datasets illustrate the superior performance of TMKGE in terms of topic coherence and document classification accuracy, compared to state-of-the-art topic modeling methods.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implement method for determining latent topics for a corpus of documents, the method comprising: given a corpus of documents in which each document comprises words and entities: using entity embeddings obtained from a knowledge graph to represent entities in the corpus of documents; and for each document in the corpus of documents, generating a word frequency representation for words from the document; using the entity embeddings and the word frequency representations as an input to a topic model to generate latent topics for the corpus of documents, the topic model comprising: a corpus-level Dirichlet process that uses the word frequency representations and the entity embeddings to obtain a shared base measure that is used as a prior for two document-level Dirichlet processes; a first document-level Dirichlet process that uses the shared base measure as a prior to generate a latent topic distribution for words; a second document-level Dirichlet process that uses the shared base measure as a prior to generate a latent topic distribution for entity embeddings; a word generative process that uses the latent topic distribution for words and a word-level Dirichlet process to assign latent topics to words; and using the latent topic distribution for entity embeddings and a distribution to assign latent topics to entity embeddings; and assigning one or more topics to one or more documents from the corpus of documents using at least one or more of the latent topics generated by the topic model. 2. The computer-implement method of claim 1 wherein each atom of the corpus-level Dirichlet process corresponds to a set of parameters for both words and entities. 3. The computer-implement method of claim 1 wherein model parameters for the topic model are learned using an online variational inference methodology. 4. The computer-implement method of claim 3 wherein the step of using an online variational inference methodology to learn the model parameters for the topic model comprises: initialize corpus-level variational parameters of the topic model; and iteratively performing the following steps until a stop condition has been met: sampling a document at random from the corpus of documents; updating document-level variational parameters of the topic model; and updating corpus-level parameters. 5. The computer-implement method of claim 4 wherein the steps of updating document-level variational parameters of the topic model, and updating corpus-level parameters comprise the steps of: updating word-related variational parameters of the topic model; updating entity-related variational parameters of the topic model; computing natural gradients using word-related variational parameters and entity-related variational parameters; updating a learning rate parameter; and updating corpus-level parameters of the topic model using the natural gradients and the learning rate parameter. 6. The computer-implement method of claim 5 wherein word-related variational parameters and entity-related variational parameters from a batch of documents are used to compute the natural gradients for the topic model to improve stability of the online variational inference methodology. 7. The computer-implement method of claim 1 further comprising the step of: given a set of topic model parameters, using the topic model to generate words for a document. 8. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one processor, causes steps to be performed comprising: given a corpus of documents in which each document comprises words and entities: using entity embeddings obtained from a knowledge graph to represent entities in the corpus of documents; and for each document in the corpus of documents, generating a word frequency representation for words from the document; using the entity embeddings and the word frequency representations as an input to a topic model to generate latent topics for the corpus of documents, the topic model comprising: a corpus-level Dirichlet process that uses the word frequency representations and the entity embeddings to obtain a shared base measure that is used as a prior for two document-level Dirichlet processes; a first document-level Dirichlet process that uses the shared base measure as a prior to generate a latent topic distribution for words; a second document-level Dirichlet process that uses the shared base measure as a prior to generate a latent topic distribution for entity embeddings; a word generative process that uses the latent topic distribution for words and a word-level Dirichlet process to assign latent topics to words; and using the latent topic distribution for entity embeddings and a distribution to assign latent topics to entity embeddings; and assigning one or more topics to one or more documents from the corpus of documents using at least one or more of the latent topics generated by the topic model. 9. The non-transitory computer-readable medium or media of claim 8 wherein each atom of the corpus-level Dirichlet process corresponds to a set of parameters for both words and entities. 10. The non-transitory computer-readable medium or media of claim 8 wherein model parameters for the topic model are learned using an online variational inference methodology. 11. The non-transitory computer-readable medium or media of claim 10 wherein the step of using an online variational inference methodology to learn the model parameters for the topic model comprises: initialize corpus-level variational parameters of the topic model; and iteratively performing the following steps until a stop condition has been met: sampling a document at random from the corpus of documents; updating document-level variational parameters of the topic model; and updating corpus-level parameters. 12. The non-transitory computer-readable medium or media of claim 11 wherein the steps of updating document-level variational parameters of the topic model, and updating corpus-level parameters comprise the steps of: updating word-related variational parameters of the topic model; updating entity-related variational parameters of the topic model; computing natural gradients using word-related variational parameters and entity-related variational parameters; updating a learning rate parameter; and updating corpus-level parameters of the topic model using the natural gradients and the learning rate parameter. 13. The non-transitory computer-readable medium or media of claim 12 wherein word-related variational parameters and entity-related variational parameters from a batch of documents are used to compute the natural gradients for the topic model to improve stability of the online variational inference methodology. 14. The non-transitory computer-readable medium or media of claim 8 further comprising one or more sequences of instructions which, when executed by at least one processor, causes steps to be performed comprising: given a set of topic model parameters, using the topic model to generate words for a document. 15. A computing system comprising: at least one processor; and a non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: given a corpus of documents in which each document comprises words and entities: using entity embeddings obtained from a knowledge graph to represent entities in the corpus of docu

Assignees

Inventors

Classifications

  • Natural language analysis (semantic analysis of natural language G06F40/30) · CPC title

  • Semantic analysis · CPC title

  • G06F16/367Primary

    Ontology · CPC title

  • Machine learning · CPC title

  • G06N5/04Primary

    Inference or reasoning models · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11636355B2 cover?
Leveraging domain knowledge is an effective strategy for enhancing the quality of inferred low-dimensional representations of documents by topic models. Presented herein are embodiments of a Bayesian nonparametric model that employ knowledge graph (KG) embedding in the context of topic modeling for extracting more coherent topics; embodiments of the model may be referred to as topic modeling wi…
Who is the assignee on this patent?
Baidu Usa Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/367. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 25 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).