Distributed clustering with outlier detection

US9589045B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9589045-B2
Application numberUS-201414247351-A
CountryUS
Kind codeB2
Filing dateApr 8, 2014
Priority dateApr 8, 2014
Publication dateMar 7, 2017
Grant dateMar 7, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

One or more processors initiate cluster feature (CF)-tree based hierarchical clustering on leaf entries of CF-trees included in a plurality of subsets. One or more processors, generate respective partial clustering solutions for the subsets. A partial clustering solution includes a set of regular sub-clusters and candidate outlier sub-clusters. One or more processors generate initial regular clusters by performing hierarchical clustering using the regular sub-clusters. For a candidate outlier sub-cluster, one or more processors determine a closest initial regular cluster, and a distance separating the candidate outlier sub-cluster and the closest initial regular cluster. One or more processors determine which candidate outlier sub-clusters are outlier clusters based on which candidate outlier sub-clusters have a computed distance to their respective closest initial regular cluster that is greater than a corresponding distance threshold associated with their respective closest initial regular cluster.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer program product for outlier detection, the computer program product comprising: one or more computer-readable storage media and program instructions stored on the one or more computer-readable storage media, the program instructions comprising: program instructions to, based, at least in part, on a set of CF-trees, identify a plurality of subsets of CF-trees included in the set of CF-trees; program instructions to initiate cluster feature (CF)-tree based hierarchical clustering on leaf entries of a plurality of CF-trees included in a given subset of CF-trees, wherein each CF-tree presents a respective set of local data as leaf entries; program instructions to generate respective partial clustering solutions using the given subset of CF-trees, wherein clustering activity for a resulting CF-tree ceases when a threshold number of sub-clusters has been reached, and wherein a partial clustering solution includes a set of regular sub-clusters that are included in the resulting CF-tree and candidate outlier sub-clusters that contain summaries of sets of data records; program instructions to generate initial regular clusters by performing hierarchical clustering using the set of regular sub-clusters; program instructions to determine, for each candidate outlier sub-cluster of a plurality of candidate outlier sub-clusters, a closest initial regular cluster, and a distance separating the respective candidate outlier sub-cluster and the respective closest initial regular cluster; and program instructions to determine which candidate outlier sub-clusters are outlier clusters based, at least in part, on which candidate outlier sub-clusters have a computed distance to their respective closest initial regular cluster that is greater than a corresponding distance threshold associated with their respective closest initial regular cluster. 2. The computer program product of claim 1 , the program instructions further comprising: program instructions to identify one or more candidate outliers of the CF-trees and their closest leaf entries; program instructions to determine whether merging a candidate outlier, included in the one or more candidate outliers, with its closest leaf entry will break a maximal tightness threshold; and program instructions to respond to a determination that merging the candidate outlier with its closest leaf entry will not break the maximal tightness threshold, merging that candidate outlier with its closest leaf entry. 3. The computer program product of claim 1 , the program instructions further comprising: program instructions to merge one or more candidate outlier sub-clusters that are determined not to be outlier clusters with their respective closest initial regular clusters; and program instructions to update the distance threshold for one or more initial regular clusters that were merged with one or more candidate outlier sub-clusters. 4. The computer program product of claim 1 , the program instructions further comprising: program instructions to determine a measure of outlier strength for candidate outlier sub-clusters. 5. The computer program product of claim 1 , the program instructions further comprising: program instructions to compute an outlier strength for a plurality of outlier clusters; and program instructions to sort the plurality of outlier clusters according to outlier strength. 6. The computer program product of claim 1 , the program instructions further comprising: program instructions to generate a specified number of outlier clusters, based, at least in part, on one or more of the following measures for each determined outlier cluster: a size of the outlier cluster, an outlier strength of the outlier cluster, and a probability of the outlier cluster belonging to a regular cluster. 7. The computer program product of claim 1 , the program instructions to further comprising: program instructions to determine a minimum outlier strength for a top percent of outlier clusters; and program instructions to determine whether a new data record is an outlier based, at least in part on a comparison of a determined outlier strength for the new data record and the minimum outlier strength, wherein a determined outlier strength that is above the minimum outlier strength indicates that the new data record is an outlier. 8. A computer system for outlier detection, the computer system comprising: one or more computer processors; one or more computer readable storage medium; program instructions stored on the computer readable storage medium for execution by at least one of the one or more processors, the program instructions comprising: program instructions to, based, at least in part, on a set of CF-trees, identify a plurality of subsets of CF-trees included in the set of CF-trees; program instructions to initiate cluster feature (CF)-tree based hierarchical clustering on leaf entries of a plurality of CF-trees included in a given subset of CF-trees, wherein each CF-tree presents a respective set of local data as leaf entries; program instructions to generate respective partial clustering solutions using the given subset of CF-trees, wherein clustering activity for a resulting CF-tree ceases when a threshold number of sub-clusters has been reached, and wherein a partial clustering solution includes a set of regular sub-clusters that are included in the resulting CF-tree and candidate outlier sub-clusters that contain summaries of sets of data records; program instructions to generate initial regular clusters by performing hierarchical clustering using the set of regular sub-clusters; program instructions to, for each candidate outlier sub-cluster of a plurality of candidate outlier sub-clusters, determine a closest initial regular cluster, and a distance separating the respective candidate outlier sub-cluster and the respective closest initial regular cluster; and program instructions to determine which candidate outlier sub-clusters are outlier clusters based, at least in part, on which candidate outlier sub-clusters have a computed distance to their respective closest initial regular cluster that is greater than a corresponding distance threshold associated with their respective closest initial regular cluster. 9. The computer system of claim 8 , the program instructions further comprising: program instructions to identify one or more candidate outliers of the CF-trees and their closest leaf entries; program instructions to determine whether merging a candidate outlier, included in the one or more candidate outliers, with its closest leaf entry will break a maximal tightness threshold; and program instructions to respond to a determination that merging the candidate outlier with its closest leaf entry will not break the maximal tightness threshold, by merging that candidate outlier with its closest leaf entry. 10. The computer system of claim 8 , the program instructions further comprising: program instructions to merge one or more candidate outlier sub-clusters that are determined not to be outlier clusters with their respective closest initial regular clusters; and program instructions to update the distance threshold for one or more initial regular clusters that were merged with one or more candidate outlier sub-clusters. 11. The computer system of claim 8 , the program instructions further comprising: program instructions to determine a measure of outlier strength for candidate outlier sub-clusters. 12. The computer system of claim 8 , the program instructions further comprising: program instructions to generate a specified number of outlier clusters, based, at least in part,

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9589045B2 cover?
One or more processors initiate cluster feature (CF)-tree based hierarchical clustering on leaf entries of CF-trees included in a plurality of subsets. One or more processors, generate respective partial clustering solutions for the subsets. A partial clustering solution includes a set of regular sub-clusters and candidate outlier sub-clusters. One or more processors generate initial regular cl…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/285. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 07 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).