Massive clustering of discrete distributions

US9720998B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9720998-B2
Application numberUS-201314081525-A
CountryUS
Kind codeB2
Filing dateNov 15, 2013
Priority dateNov 19, 2012
Publication dateAug 1, 2017
Grant dateAug 1, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The trend of analyzing big data in artificial intelligence requires more scalable machine learning algorithms, among which clustering is a fundamental and arguably the most widely applied method. To extend the applications of regular vector-based clustering algorithms, the Discrete Distribution (D2) clustering algorithm has been developed for clustering bags of weighted vectors which are well adopted in many emerging machine learning applications. The high computational complexity of D2-clustering limits its impact in solving massive learning problems. Here we present a parallel D2-clustering algorithm with substantially improved scalability. We develop a hierarchical structure for parallel computing in order to achieve a balance between the individual-node computation and the integration process of the algorithm. The parallel algorithm achieves significant speed-up with minor accuracy loss.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method of determining similarities in still or moving images utilizing a hierarchical structure including a base level and higher levels, the method comprising the steps of: a) performing an initial segmentation of data objects in the base level into a plurality of segments using a master processor, each object being representative of a different image or video, the data objects included in the base level of the hierarchy being original data objects and a data set of the base level; b) performing a discrete distribution (D2) clustering on each of the segments on one or more slave processors to determine a group of clusters within the segment, each of the clusters corresponding to a local centroid; c) combining the local centroids within all of the segments determined in step b) into one global data set of local centroids and performing an initial segmentation of this global data set of local centroids into a plurality of segments using the master processor, the global data set of local centroids representing a higher level of the hierarchy, a size of the data set of the higher level being smaller than a size of data set of the previous level; d) iteratively repeating steps b) and c) at higher levels in the hierarchy until a single segmentation of the data objects is achieved, the number of centroids is reduced to a predefined number, or a predefined threshold based on distances of the data objects to the centroids is satisfied; and e) outputting information regarding the way in which the data objects are clustered in terms of similarity. 2. The method of claim 1 , wherein the initial segmentation of the data objects is based upon adjacency. 3. The method of claim 1 , wherein the D2 clustering operations are performed by parallel slave processors. 4. The method of claim 1 , wherein the D2 clustering operations are performed by the slave processors in sequence. 5. The method of claim 1 , including the steps of: assigning different cluster labels to the data objects within each segment; and optimizing the cluster labels of each data object to determine the local centroids. 6. The method of claim 1 , including the step of imposing one or more constraints on the centroids passed to each level in the hierarchy. 7. The method of claim 1 , wherein the cluster centroids passed to successively higher levels are weighted to maintain equal contributions from each original data object. 8. The method of claim 1 , wherein the master processor distributes the data segments to different parallel slave processors to perform the D2-clustering at each level in the hierarchy. 9. The method of claim 8 , including the use of synchronized Message Passing Interface (MPI) and MapReduce techniques to perform message passing and data transmission between the master and slave processors. 10. The method of claim 1 , wherein the objects to be clustered are mathematically represented by discrete distributions or bags of weighted vectors. 11. The method of claim 1 , wherein the series of discrete distribution (D2) clustering operations are performed by physically separate parallel processors or separate cores of an integrated device. 12. The method of claim 1 , the step of performing an initial segmentation in a) or c) comprising: a) performing an initial clustering of the data objects or data set using a constrained discrete distribution (D2) clustering algorithm to divide the data into segments; b) performing a series of constrained D2 clustering operations on one or more of the segments to divide the data into additional segments, if necessary; and c) iteratively repeating step b), if necessary, until the size of each segment is reduced to an acceptable level, the number of segments increased to an acceptable level, or another stopping criterion is satisfied. 13. The method of claim 12 , wherein the constraint includes a fixed weighting scheme on the support vectors of the discrete distributions that are the cluster centroids. 14. The method of claim 5 , wherein the step of optimizing the cluster labels comprises iteration of updating the clusters based on nearest centroids from each data object and updating the centroids based on the updated clusters. 15. The method of claim 1 , including the steps of assigning, at the master processor, each centroid as a cluster label to the data object in the corresponding cluster, the label propagating to all descendent in the previous levels. 16. The method of claim 1 , wherein the combination of the local centroids of all of the segments in a hierarchical level is different from a set of centroids obtained if a D2-clustering is directly performed on all data of the data set in the hierarchical level. 17. A method of determining similarities in a biological process or genetic sequence utilizing a hierarchical structure including a base level and higher levels, the method comprising the steps of: a)) performing a segmentation of data objects in the base level into a plurality of segments using a master processor, each data object being representative of a different biological process or genetic sequence, the data objects included in the base level of the hierarchy being original data objects and a data set of the base level; b) performing a discrete distribution (D2) clustering on each of the segments on one or more slave processors to determine a group of dusters within each of the segments, each of the clusters corresponding to a local centroid; c) combining the local centroids within all of the segments determined in step b) into one global data set of local centroids and performing a segmentation of this global data set of local centroids into a plurality of segments using the master processor, the global data set of local centroids representing a higher level of the hierarchy, a size of the data set of the higher level being smaller than a size of the data set of the previous level; d) iteratively repeating steps b) and c) at higher levels in the hierarchy until a single segmentation of the data objects is achieved, the number of centroids is reduced to a predefined number, or a predefined threshold based on distances of the data objects to the centroids is satisfied; and e) outputting information regarding the way in which the data objects are clustered in terms of similarity.

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9720998B2 cover?
The trend of analyzing big data in artificial intelligence requires more scalable machine learning algorithms, among which clustering is a fundamental and arguably the most widely applied method. To extend the applications of regular vector-based clustering algorithms, the Discrete Distribution (D2) clustering algorithm has been developed for clustering bags of weighted vectors which are well a…
Who is the assignee on this patent?
Penn State Res Found
What technology area does this patent fall under?
Primary CPC classification G06F16/285. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 01 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).