Method and system for latent dirichlet allocation computation using approximate counters

US10147044B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10147044-B2
Application numberUS-201514820169-A
CountryUS
Kind codeB2
Filing dateAug 6, 2015
Priority dateFeb 4, 2015
Publication dateDec 4, 2018
Grant dateDec 4, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Herein is described a data-parallel algorithm for topic modeling in which the memory requirements are streamlined for implementation on a highly-parallel architecture, such as a GPU. Specifically, approximate counters are used in a large mixture model or clustering algorithm (e.g., an uncollapsed Gibbs sampler) to decrease memory usage over what is required when conventional counters are used. The decreased memory usage of the approximate counters allows a highly-parallel architecture with limited memory to process more computations for the large mixture model more efficiently. Embodiments describe binary Morris approximate counters, general Morris approximate counters, and Csrös approximate counters in the context of an uncollapsed Gibbs sampler, and, more specifically, for a Greedy Gibbs sampler.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for identifying sets of correlated words comprising: running an uncollapsed Gibbs sampler over a Dirichlet distribution of a plurality of words in a set of documents to produce sampler result data, further comprising: representing one or more counts in the uncollapsed Gibbs sampler using one or more approximate counters, and using one or more probabilistic techniques to increment the one or more approximate counters; and determining, from the sampler result data, one or more sets of correlated words; wherein the method is performed by one or more computing devices. 2. The method of claim 1 , wherein the one or more approximate counters comprise binary Morris approximate counters. 3. The method of claim 1 , wherein the one or more approximate counters comprise general Morris approximate counters. 4. The method of claim 1 , wherein the one or more approximate counters comprise Cs rös approximate counters. 5. The method of claim 1 , wherein running the uncollapsed Gibbs sampler over the Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprises representing one or more other counts in the uncollapsed Gibbs sampler using one or more conventional counters. 6. The method of claim 1 , wherein the one or more approximate counters comprise two or more of: binary Morris approximate counters, general Morris approximate counters, Cs rös approximate counters, or conventional counters. 7. The method of claim 1 , wherein: running the uncollapsed Gibbs sampler over the Dirichlet distribution comprises computing in parallel a plurality of values, including the one or more counts, for the uncollapsed Gibbs sampler; and the plurality of values are computed in a plurality of parallel Single Program Multiple Data (SPMD) units on a graphics processing unit (GPU). 8. The method of claim 1 , further comprising representing each of the one or more approximate counters using eight bits or fewer than eight bits. 9. The method of claim 1 , wherein the uncollapsed Gibbs sampler has at least one variable uncollapsed. 10. The method of claim 1 , wherein the uncollapsed Gibbs sampler is a Greedy Gibbs sampler. 11. One or more non-transitory computer-readable media storing one or more sequences of instructions for identifying sets of correlated words, wherein said one or more sequences of instructions, when executed by one or more processors, cause: running an uncollapsed Gibbs sampler over a Dirichlet distribution of a plurality of words in a set of documents to produce sampler result data, further comprising: representing one or more counts in the uncollapsed Gibbs sampler using one or more approximate counters, and using one or more probabilistic techniques to increment the one or more approximate counters; and determining, from the sampler result data, one or more sets of correlated words. 12. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more approximate counters comprise binary Morris approximate counters. 13. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more approximate counters comprise general Morris approximate counters. 14. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more approximate counters comprise Cs rös approximate counters. 15. The one or more non-transitory computer-readable media of claim 11 , wherein running the uncollapsed Gibbs sampler over the Dirichlet distribution of the plurality of words in the set of documents to produce sampler result data, further comprises representing one or more other counts in the uncollapsed Gibbs sampler using one or more conventional counters. 16. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more approximate counters comprise two or more of: binary Morris approximate counters, general Morris approximate counters, Cs rös approximate counters, or conventional counters. 17. The one or more non-transitory computer-readable media of claim 11 , wherein: running the uncollapsed Gibbs sampler over the Dirichlet distribution comprises computing in parallel a plurality of values, including the one or more counts, for the uncollapsed Gibbs sampler; and the one or more sequences of instructions include instructions, that when executed by one or more processors, cause the plurality of values to be computed in a plurality of parallel Single Program Multiple Data (SPMD) units on a graphics processing unit (GPU). 18. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more sequences of instructions include instructions, that when executed by one or more processors, cause representing each of the one or more approximate counters using eight bits or fewer than eight bits. 19. The one or more non-transitory computer-readable media of claim 11 , wherein the uncollapsed Gibbs sampler has at least one variable uncollapsed. 20. The one or more non-transitory computer-readable media of claim 11 , wherein the uncollapsed Gibbs sampler is a Greedy Gibbs sampler.

Assignees

Inventors

Classifications

  • G06N7/01Primary

    Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • G06F16/355Primary

    Creation or modification of classes or clusters · CPC title

  • Semantic analysis · CPC title

  • Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs (mappping at compile time, see G06F8/451) · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10147044B2 cover?
Herein is described a data-parallel algorithm for topic modeling in which the memory requirements are streamlined for implementation on a highly-parallel architecture, such as a GPU. Specifically, approximate counters are used in a large mixture model or clustering algorithm (e.g., an uncollapsed Gibbs sampler) to decrease memory usage over what is required when conventional counters are used. …
Who is the assignee on this patent?
Oracle Int Corp
What technology area does this patent fall under?
Primary CPC classification G06N7/01. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 04 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).