Method and system for distributed latent dirichlet allocation computation using addition of approximate counters
US-2017039265-A1 · Feb 9, 2017 · US
US9767416B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9767416-B2 |
| Application number | US-201514755312-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 30, 2015 |
| Priority date | Feb 4, 2015 |
| Publication date | Sep 19, 2017 |
| Grant date | Sep 19, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Herein is described a data-parallel and sparse algorithm for topic modeling. This algorithm is based on a highly parallel algorithm for a Greedy Gibbs sampler. The Greedy Gibbs sampler is a Markov-Chain Monte Carlo algorithm that estimates topics, in an unsupervised fashion, by estimating the parameters of the topic model Latent Dirichlet Allocation (LDA). The Greedy Gibbs sampler is a data-parallel algorithm for topic modeling, and is configured to be implemented on a highly-parallel architecture, such as a GPU. The Greedy Gibbs sampler is modified to take advantage of data sparsity while maintaining the parallelism. Furthermore, in an embodiment, implementation of the Greedy Gibbs sampler uses both densely-represented and sparsely-represented matrices to reduce the amount of computation while maintaining fast accesses to memory for implementation on a GPU.
Opening claim text (preview).
What is claimed is: 1. A method for identifying sets of correlated words comprising: receiving information for a set of documents; wherein the set of documents comprises a plurality of words; wherein a particular document of the set of documents comprises a particular word of the plurality of words; running a Greedy Gibbs sampler over a Dirichlet distribution of the plurality of words, comprising: computing probabilities of one or more topics, of a plurality of topics, for the particular word in the particular document based, at least in part, on a probability mass for one of: hyperparameters, the particular document, or the particular word; determining, from results of running the Greedy Gibbs sampler over the Dirichlet distribution, one or more sets of correlated words; wherein the one or more sets of correlated words comprises words from the plurality of words; and wherein the method is performed by one or more computing devices. 2. The method of claim 1 , wherein the Greedy Gibbs sampler has at least one non-collapsed variable. 3. The method of claim 1 , wherein running the Greedy Gibbs sampler over the Dirichlet distribution of the plurality of words, further comprises: prior to computing the probabilities of the one or more topics, calculating the probability mass for at least one of: hyperparameters, the particular document, or the particular word; wherein calculating the probability mass for at least one of hyperparameters, the particular document, or the particular word is based, at least in part, on calculating a mean of the Dirichlet distribution. 4. The method of claim 1 , wherein computing the probabilities of the one or more topics further comprises: computing a probability of a particular topic of the plurality of topics for the particular word in the particular document based, at least in part, on the probability mass for the particular document; wherein computing the probability of the particular topic for the particular word in the particular document comprises determining that the probability of the particular topic for the particular word in the particular document is zero based on the probability mass for the particular document being zero. 5. The method of claim 1 , wherein computing the probabilities of the one or more topics further comprises: computing a probability of a particular topic of the plurality of topics for the particular word in the particular document based, at least in part, on the probability mass for the particular word; wherein computing the probability of the particular topic for the particular word in the particular document comprises determining that the probability of the particular topic for the particular word in the particular document is zero based on the probability mass for the particular word being zero. 6. A method for identifying sets of correlated words comprising: receiving information for a set of documents; wherein the set of documents comprises a plurality of words; wherein a particular document of the set of documents comprises a particular word of the plurality of words; running a Greedy Gibbs sampler over a Dirichlet distribution of the plurality of words, further comprising: computing probabilities of one or more topics, of a plurality of topics, being assigned to the particular word in the particular document based, at least in part, on both of: (a) one or more densely-represented matrices, and (b) one or more sparsely-represented matrices; determining, from results of running the Greedy Gibbs sampler over the Dirichlet distribution, one or more sets of correlated words; wherein the one or more sets of correlated words comprises words from the plurality of words; and wherein the method is performed by one or more computing devices. 7. The method of claim 6 , wherein both a matrix that represents topics per document and a matrix that represents words per topic are densely-represented matrices. 8. The method of claim 6 , wherein both a matrix representing phi values and a matrix representing theta values are sparsely-represented matrices. 9. The method of claim 8 , wherein: a particular topic, of the one or more topics, corresponds to a particular non-zero value in one of: the matrix representing the phi values, or the matrix representing the theta values; and computing the probabilities of the one or more topics being assigned to the particular word in the particular document comprises: computing a probability of the particular topic being assigned to the particular word based on inclusion of the particular non-zero value in a sparsely-represented matrix. 10. The method of claim 6 , wherein: the one or more sparsely-represented matrices are one or more particular matrices represented as sparsely-represented matrices; and the method further comprises: prior to computing probabilities of the one or more topics, of the plurality of topics, being assigned to the particular word in the particular document: representing a certain matrix of the one or more particular matrices as a densely-represented matrix; computing topic probabilities based, at least in part, on the densely-represented certain matrix; after computing topic probabilities based, at least in part, on the densely-represented certain matrix: converting representation of data in the certain matrix from a dense representation of the data to a sparse representation of the data to produce a sparsely-represented certain matrix; and after converting the representation of the data in the certain matrix, computing probabilities of the one or more topics, of the plurality of topics, being assigned to the particular word in the particular document based, at least in part, on the sparsely-represented certain matrix. 11. The method of claim 6 , wherein computing the probabilities of the one or more topics, of the plurality of topics, being assigned to the particular word in the particular document further comprises: determining, based on a filter that is specific to values for the particular document, whether a non-zero theta value, for a particular topic and for the particular document, is included in a theta matrix; and in response to determining that a non-zero theta value, for the particular topic and for the particular document, is included in the theta matrix, retrieving the non-zero theta value from the theta matrix. 12. The method of claim 11 , wherein the filter is implemented with a Bloom filter. 13. One or more non-transitory computer-readable media storing instructions which, when executed by one or more processors, cause performance of identifying sets of correlated words comprising: receiving information for a set of documents; wherein the set of documents comprises a plurality of words; wherein a particular document of the set of documents comprises a particular word of the plurality of words; running a Greedy Gibbs sampler over a Dirichlet distribution of the plurality of words, comprising: computing probabilities of one or more topics, of a plurality of topics, for the particular word in the particular document based, at least in part, on a probability mass for one of: hyperparameters, the particular document, or the particular word; determining, from results of running the Greedy Gibbs sampler over the Dirichlet distribution, one or more sets of correlated words; wherein the one or more sets of correlated words comprises words from the plurality of words. 14. The one or more non-transitory computer-readable media of claim 13 , wherein the Greedy Gibbs sampler has at least one non-collapsed variable. 15. The one or more non-tran
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Creation or modification of classes or clusters · CPC title
Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs (mappping at compile time, see G06F8/451) · CPC title
Semantic analysis · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.