What technology area does this patent fall under?

Primary CPC classification G06F16/367. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 25 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Integration of knowledge graph embedding into topic modeling with hierarchical Dirichlet process

US11636355B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11636355-B2
Application number	US-201916427225-A
Country	US
Kind code	B2
Filing date	May 30, 2019
Priority date	May 30, 2019
Publication date	Apr 25, 2023
Grant date	Apr 25, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Leveraging domain knowledge is an effective strategy for enhancing the quality of inferred low-dimensional representations of documents by topic models. Presented herein are embodiments of a Bayesian nonparametric model that employ knowledge graph (KG) embedding in the context of topic modeling for extracting more coherent topics; embodiments of the model may be referred to as topic modeling with knowledge graph embedding (TMKGE). TMKGE embodiments are hierarchical Dirichlet process (HDP)-based models that flexibly borrow information from a KG to improve the interpretability of topics. Also, embodiments of a new, efficient online variational inference method based on a stick-breaking construction of HDP were developed for TMKGE models, making TMKGE suitable for large document corpora and KGs. Experiments on datasets illustrate the superior performance of TMKGE in terms of topic coherence and document classification accuracy, compared to state-of-the-art topic modeling methods.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implement method for determining latent topics for a corpus of documents, the method comprising: given a corpus of documents in which each document comprises words and entities: using entity embeddings obtained from a knowledge graph to represent entities in the corpus of documents; and for each document in the corpus of documents, generating a word frequency representation for words from the document; using the entity embeddings and the word frequency representations as an input to a topic model to generate latent topics for the corpus of documents, the topic model comprising: a corpus-level Dirichlet process that uses the word frequency representations and the entity embeddings to obtain a shared base measure that is used as a prior for two document-level Dirichlet processes; a first document-level Dirichlet process that uses the shared base measure as a prior to generate a latent topic distribution for words; a second document-level Dirichlet process that uses the shared base measure as a prior to generate a latent topic distribution for entity embeddings; a word generative process that uses the latent topic distribution for words and a word-level Dirichlet process to assign latent topics to words; and using the latent topic distribution for entity embeddings and a distribution to assign latent topics to entity embeddings; and assigning one or more topics to one or more documents from the corpus of documents using at least one or more of the latent topics generated by the topic model. 2. The computer-implement method of claim 1 wherein each atom of the corpus-level Dirichlet process corresponds to a set of parameters for both words and entities. 3. The computer-implement method of claim 1 wherein model parameters for the topic model are learned using an online variational inference methodology. 4. The computer-implement method of claim 3 wherein the step of using an online variational inference methodology to learn the model parameters for the topic model comprises: initialize corpus-level variational parameters of the topic model; and iteratively performing the following steps until a stop condition has been met: sampling a document at random from the corpus of documents; updating document-level variational parameters of the topic model; and updating corpus-level parameters. 5. The computer-implement method of claim 4 wherein the steps of updating document-level variational parameters of the topic model, and updating corpus-level parameters comprise the steps of: updating word-related variational parameters of the topic model; updating entity-related variational parameters of the topic model; computing natural gradients using word-related variational parameters and entity-related variational parameters; updating a learning rate parameter; and updating corpus-level parameters of the topic model using the natural gradients and the learning rate parameter. 6. The computer-implement method of claim 5 wherein word-related variational parameters and entity-related variational parameters from a batch of documents are used to compute the natural gradients for the topic model to improve stability of the online variational inference methodology. 7. The computer-implement method of claim 1 further comprising the step of: given a set of topic model parameters, using the topic model to generate words for a document. 8. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one processor, causes steps to be performed comprising: given a corpus of documents in which each document comprises words and entities: using entity embeddings obtained from a knowledge graph to represent entities in the corpus of documents; and for each document in the corpus of documents, generating a word frequency representation for words from the document; using the entity embeddings and the word frequency representations as an input to a topic model to generate latent topics for the corpus of documents, the topic model comprising: a corpus-level Dirichlet process that uses the word frequency representations and the entity embeddings to obtain a shared base measure that is used as a prior for two document-level Dirichlet processes; a first document-level Dirichlet process that uses the shared base measure as a prior to generate a latent topic distribution for words; a second document-level Dirichlet process that uses the shared base measure as a prior to generate a latent topic distribution for entity embeddings; a word generative process that uses the latent topic distribution for words and a word-level Dirichlet process to assign latent topics to words; and using the latent topic distribution for entity embeddings and a distribution to assign latent topics to entity embeddings; and assigning one or more topics to one or more documents from the corpus of documents using at least one or more of the latent topics generated by the topic model. 9. The non-transitory computer-readable medium or media of claim 8 wherein each atom of the corpus-level Dirichlet process corresponds to a set of parameters for both words and entities. 10. The non-transitory computer-readable medium or media of claim 8 wherein model parameters for the topic model are learned using an online variational inference methodology. 11. The non-transitory computer-readable medium or media of claim 10 wherein the step of using an online variational inference methodology to learn the model parameters for the topic model comprises: initialize corpus-level variational parameters of the topic model; and iteratively performing the following steps until a stop condition has been met: sampling a document at random from the corpus of documents; updating document-level variational parameters of the topic model; and updating corpus-level parameters. 12. The non-transitory computer-readable medium or media of claim 11 wherein the steps of updating document-level variational parameters of the topic model, and updating corpus-level parameters comprise the steps of: updating word-related variational parameters of the topic model; updating entity-related variational parameters of the topic model; computing natural gradients using word-related variational parameters and entity-related variational parameters; updating a learning rate parameter; and updating corpus-level parameters of the topic model using the natural gradients and the learning rate parameter. 13. The non-transitory computer-readable medium or media of claim 12 wherein word-related variational parameters and entity-related variational parameters from a batch of documents are used to compute the natural gradients for the topic model to improve stability of the online variational inference methodology. 14. The non-transitory computer-readable medium or media of claim 8 further comprising one or more sequences of instructions which, when executed by at least one processor, causes steps to be performed comprising: given a set of topic model parameters, using the topic model to generate words for a document. 15. A computing system comprising: at least one processor; and a non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: given a corpus of documents in which each document comprises words and entities: using entity embeddings obtained from a knowledge graph to represent entities in the corpus of docu

Assignees

Baidu Usa Llc

Inventors

Classifications

G06F40/20
Natural language analysis (semantic analysis of natural language G06F40/30) · CPC title
G06F40/30
Semantic analysis · CPC title
G06F16/367Primary
Ontology · CPC title
G06N20/00
Machine learning · CPC title
G06N5/04Primary
Inference or reasoning models · CPC title

Patent family

Related publications grouped by family.

View patent family 73506454

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11636355B2 cover?: Leveraging domain knowledge is an effective strategy for enhancing the quality of inferred low-dimensional representations of documents by topic models. Presented herein are embodiments of a Bayesian nonparametric model that employ knowledge graph (KG) embedding in the context of topic modeling for extracting more coherent topics; embodiments of the model may be referred to as topic modeling wi…
Who is the assignee on this patent?: Baidu Usa Llc
What technology area does this patent fall under?: Primary CPC classification G06F16/367. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 25 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).