Sparse and data-parallel inference method and system for the latent Dirichlet allocation model
US-9767416-B2 · Sep 19, 2017 · US
US10147044B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10147044-B2 |
| Application number | US-201514820169-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 6, 2015 |
| Priority date | Feb 4, 2015 |
| Publication date | Dec 4, 2018 |
| Grant date | Dec 4, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Herein is described a data-parallel algorithm for topic modeling in which the memory requirements are streamlined for implementation on a highly-parallel architecture, such as a GPU. Specifically, approximate counters are used in a large mixture model or clustering algorithm (e.g., an uncollapsed Gibbs sampler) to decrease memory usage over what is required when conventional counters are used. The decreased memory usage of the approximate counters allows a highly-parallel architecture with limited memory to process more computations for the large mixture model more efficiently. Embodiments describe binary Morris approximate counters, general Morris approximate counters, and Csrös approximate counters in the context of an uncollapsed Gibbs sampler, and, more specifically, for a Greedy Gibbs sampler.
Opening claim text (preview).
What is claimed is: 1. A method for identifying sets of correlated words comprising: running an uncollapsed Gibbs sampler over a Dirichlet distribution of a plurality of words in a set of documents to produce sampler result data, further comprising: representing one or more counts in the uncollapsed Gibbs sampler using one or more approximate counters, and using one or more probabilistic techniques to increment the one or more approximate counters; and determining, from the sampler result data, one or more sets of correlated words; wherein the method is performed by one or more computing devices. 2. The method of claim 1 , wherein the one or more approximate counters comprise binary Morris approximate counters. 3. The method of claim 1 , wherein the one or more approximate counters comprise general Morris approximate counters. 4. The method of claim 1 , wherein the one or more approximate counters comprise Cs rös approximate counters. 5. The method of claim 1 , wherein running the uncollapsed Gibbs sampler over the Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprises representing one or more other counts in the uncollapsed Gibbs sampler using one or more conventional counters. 6. The method of claim 1 , wherein the one or more approximate counters comprise two or more of: binary Morris approximate counters, general Morris approximate counters, Cs rös approximate counters, or conventional counters. 7. The method of claim 1 , wherein: running the uncollapsed Gibbs sampler over the Dirichlet distribution comprises computing in parallel a plurality of values, including the one or more counts, for the uncollapsed Gibbs sampler; and the plurality of values are computed in a plurality of parallel Single Program Multiple Data (SPMD) units on a graphics processing unit (GPU). 8. The method of claim 1 , further comprising representing each of the one or more approximate counters using eight bits or fewer than eight bits. 9. The method of claim 1 , wherein the uncollapsed Gibbs sampler has at least one variable uncollapsed. 10. The method of claim 1 , wherein the uncollapsed Gibbs sampler is a Greedy Gibbs sampler. 11. One or more non-transitory computer-readable media storing one or more sequences of instructions for identifying sets of correlated words, wherein said one or more sequences of instructions, when executed by one or more processors, cause: running an uncollapsed Gibbs sampler over a Dirichlet distribution of a plurality of words in a set of documents to produce sampler result data, further comprising: representing one or more counts in the uncollapsed Gibbs sampler using one or more approximate counters, and using one or more probabilistic techniques to increment the one or more approximate counters; and determining, from the sampler result data, one or more sets of correlated words. 12. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more approximate counters comprise binary Morris approximate counters. 13. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more approximate counters comprise general Morris approximate counters. 14. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more approximate counters comprise Cs rös approximate counters. 15. The one or more non-transitory computer-readable media of claim 11 , wherein running the uncollapsed Gibbs sampler over the Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprises representing one or more other counts in the uncollapsed Gibbs sampler using one or more conventional counters. 16. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more approximate counters comprise two or more of: binary Morris approximate counters, general Morris approximate counters, Cs rös approximate counters, or conventional counters. 17. The one or more non-transitory computer-readable media of claim 11 , wherein: running the uncollapsed Gibbs sampler over the Dirichlet distribution comprises computing in parallel a plurality of values, including the one or more counts, for the uncollapsed Gibbs sampler; and the one or more sequences of instructions include instructions, that when executed by one or more processors, cause the plurality of values to be computed in a plurality of parallel Single Program Multiple Data (SPMD) units on a graphics processing unit (GPU). 18. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more sequences of instructions include instructions, that when executed by one or more processors, cause representing each of the one or more approximate counters using eight bits or fewer than eight bits. 19. The one or more non-transitory computer-readable media of claim 11 , wherein the uncollapsed Gibbs sampler has at least one variable uncollapsed. 20. The one or more non-transitory computer-readable media of claim 11 , wherein the uncollapsed Gibbs sampler is a Greedy Gibbs sampler.
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Creation or modification of classes or clusters · CPC title
Semantic analysis · CPC title
Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs (mappping at compile time, see G06F8/451) · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.