What technology area does this patent fall under?

Primary CPC classification G06N7/01. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 19 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Sparse and data-parallel inference method and system for the latent Dirichlet allocation model

US9767416B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9767416-B2
Application number	US-201514755312-A
Country	US
Kind code	B2
Filing date	Jun 30, 2015
Priority date	Feb 4, 2015
Publication date	Sep 19, 2017
Grant date	Sep 19, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Herein is described a data-parallel and sparse algorithm for topic modeling. This algorithm is based on a highly parallel algorithm for a Greedy Gibbs sampler. The Greedy Gibbs sampler is a Markov-Chain Monte Carlo algorithm that estimates topics, in an unsupervised fashion, by estimating the parameters of the topic model Latent Dirichlet Allocation (LDA). The Greedy Gibbs sampler is a data-parallel algorithm for topic modeling, and is configured to be implemented on a highly-parallel architecture, such as a GPU. The Greedy Gibbs sampler is modified to take advantage of data sparsity while maintaining the parallelism. Furthermore, in an embodiment, implementation of the Greedy Gibbs sampler uses both densely-represented and sparsely-represented matrices to reduce the amount of computation while maintaining fast accesses to memory for implementation on a GPU.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for identifying sets of correlated words comprising: receiving information for a set of documents; wherein the set of documents comprises a plurality of words; wherein a particular document of the set of documents comprises a particular word of the plurality of words; running a Greedy Gibbs sampler over a Dirichlet distribution of the plurality of words, comprising: computing probabilities of one or more topics, of a plurality of topics, for the particular word in the particular document based, at least in part, on a probability mass for one of: hyperparameters, the particular document, or the particular word; determining, from results of running the Greedy Gibbs sampler over the Dirichlet distribution, one or more sets of correlated words; wherein the one or more sets of correlated words comprises words from the plurality of words; and wherein the method is performed by one or more computing devices. 2. The method of claim 1 , wherein the Greedy Gibbs sampler has at least one non-collapsed variable. 3. The method of claim 1 , wherein running the Greedy Gibbs sampler over the Dirichlet distribution of the plurality of words, further comprises: prior to computing the probabilities of the one or more topics, calculating the probability mass for at least one of: hyperparameters, the particular document, or the particular word; wherein calculating the probability mass for at least one of hyperparameters, the particular document, or the particular word is based, at least in part, on calculating a mean of the Dirichlet distribution. 4. The method of claim 1 , wherein computing the probabilities of the one or more topics further comprises: computing a probability of a particular topic of the plurality of topics for the particular word in the particular document based, at least in part, on the probability mass for the particular document; wherein computing the probability of the particular topic for the particular word in the particular document comprises determining that the probability of the particular topic for the particular word in the particular document is zero based on the probability mass for the particular document being zero. 5. The method of claim 1 , wherein computing the probabilities of the one or more topics further comprises: computing a probability of a particular topic of the plurality of topics for the particular word in the particular document based, at least in part, on the probability mass for the particular word; wherein computing the probability of the particular topic for the particular word in the particular document comprises determining that the probability of the particular topic for the particular word in the particular document is zero based on the probability mass for the particular word being zero. 6. A method for identifying sets of correlated words comprising: receiving information for a set of documents; wherein the set of documents comprises a plurality of words; wherein a particular document of the set of documents comprises a particular word of the plurality of words; running a Greedy Gibbs sampler over a Dirichlet distribution of the plurality of words, further comprising: computing probabilities of one or more topics, of a plurality of topics, being assigned to the particular word in the particular document based, at least in part, on both of: (a) one or more densely-represented matrices, and (b) one or more sparsely-represented matrices; determining, from results of running the Greedy Gibbs sampler over the Dirichlet distribution, one or more sets of correlated words; wherein the one or more sets of correlated words comprises words from the plurality of words; and wherein the method is performed by one or more computing devices. 7. The method of claim 6 , wherein both a matrix that represents topics per document and a matrix that represents words per topic are densely-represented matrices. 8. The method of claim 6 , wherein both a matrix representing phi values and a matrix representing theta values are sparsely-represented matrices. 9. The method of claim 8 , wherein: a particular topic, of the one or more topics, corresponds to a particular non-zero value in one of: the matrix representing the phi values, or the matrix representing the theta values; and computing the probabilities of the one or more topics being assigned to the particular word in the particular document comprises: computing a probability of the particular topic being assigned to the particular word based on inclusion of the particular non-zero value in a sparsely-represented matrix. 10. The method of claim 6 , wherein: the one or more sparsely-represented matrices are one or more particular matrices represented as sparsely-represented matrices; and the method further comprises: prior to computing probabilities of the one or more topics, of the plurality of topics, being assigned to the particular word in the particular document: representing a certain matrix of the one or more particular matrices as a densely-represented matrix; computing topic probabilities based, at least in part, on the densely-represented certain matrix; after computing topic probabilities based, at least in part, on the densely-represented certain matrix: converting representation of data in the certain matrix from a dense representation of the data to a sparse representation of the data to produce a sparsely-represented certain matrix; and after converting the representation of the data in the certain matrix, computing probabilities of the one or more topics, of the plurality of topics, being assigned to the particular word in the particular document based, at least in part, on the sparsely-represented certain matrix. 11. The method of claim 6 , wherein computing the probabilities of the one or more topics, of the plurality of topics, being assigned to the particular word in the particular document further comprises: determining, based on a filter that is specific to values for the particular document, whether a non-zero theta value, for a particular topic and for the particular document, is included in a theta matrix; and in response to determining that a non-zero theta value, for the particular topic and for the particular document, is included in the theta matrix, retrieving the non-zero theta value from the theta matrix. 12. The method of claim 11 , wherein the filter is implemented with a Bloom filter. 13. One or more non-transitory computer-readable media storing instructions which, when executed by one or more processors, cause performance of identifying sets of correlated words comprising: receiving information for a set of documents; wherein the set of documents comprises a plurality of words; wherein a particular document of the set of documents comprises a particular word of the plurality of words; running a Greedy Gibbs sampler over a Dirichlet distribution of the plurality of words, comprising: computing probabilities of one or more topics, of a plurality of topics, for the particular word in the particular document based, at least in part, on a probability mass for one of: hyperparameters, the particular document, or the particular word; determining, from results of running the Greedy Gibbs sampler over the Dirichlet distribution, one or more sets of correlated words; wherein the one or more sets of correlated words comprises words from the plurality of words. 14. The one or more non-transitory computer-readable media of claim 13 , wherein the Greedy Gibbs sampler has at least one non-collapsed variable. 15. The one or more non-tran

Assignees

Oracle Int Corp

Inventors

Classifications

G06N7/01Primary
Probabilistic graphical models, e.g. probabilistic networks · CPC title
G06F16/355Primary
Creation or modification of classes or clusters · CPC title
G06F9/5066
Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs (mappping at compile time, see G06F8/451) · CPC title
G06F40/30
Semantic analysis · CPC title
G06F40/284
Lexical analysis, e.g. tokenisation or collocates · CPC title

Patent family

Related publications grouped by family.

View patent family 56553131

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9767416B2 cover?: Herein is described a data-parallel and sparse algorithm for topic modeling. This algorithm is based on a highly parallel algorithm for a Greedy Gibbs sampler. The Greedy Gibbs sampler is a Markov-Chain Monte Carlo algorithm that estimates topics, in an unsupervised fashion, by estimating the parameters of the topic model Latent Dirichlet Allocation (LDA). The Greedy Gibbs sampler is a data-par…
Who is the assignee on this patent?: Oracle Int Corp
What technology area does this patent fall under?: Primary CPC classification G06N7/01. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 19 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Method and system for distributed latent dirichlet allocation computation using addition of approximate counters

Learning Topics By Simulation Of A Stochastic Cellular Automaton

Sparse and data-parallel inference method and system for the latent dirichlet allocation model

Parallel gibbs sampler using butterfly-patterned partial sums

Method and system for latent dirichlet allocation computation using approximate counters

Data-parallel parameter estimation of the latent dirichlet allocation model by greedy gibbs sampling

Data-parallel probabilistic inference

Frequently asked questions