Distributed clustering with outlier detection

US10042912B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10042912-B2
Application numberUS-201414553093-A
CountryUS
Kind codeB2
Filing dateNov 25, 2014
Priority dateApr 8, 2014
Publication dateAug 7, 2018
Grant dateAug 7, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

One or more processors initiate cluster feature (CF)-tree based hierarchical clustering on leaf entries of CF-trees included in a plurality of subsets. One or more processors, generate respective partial clustering solutions for the subsets. A partial clustering solution includes a set of regular sub-clusters and candidate outlier sub-clusters. One or more processors generate initial regular clusters by performing hierarchical clustering using the regular sub-clusters. For a candidate outlier sub-cluster, one or more processors determine a closest initial regular cluster, and a distance separating the candidate outlier sub-cluster and the closest initial regular cluster. One or more processors determine which candidate outlier sub-clusters are outlier clusters based on which candidate outlier sub-clusters have a computed distance to their respective closest initial regular cluster that is greater than a corresponding distance threshold associated with their respective closest initial regular cluster.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of outlier detection, the method comprising: identifying, by one or more processors, a plurality of subsets of CF-trees based, at least in part, on a set of CF-trees; initiating, by one or more processors, cluster feature (CF)-tree based hierarchical clustering on leaf entries of a plurality of CF-trees included in a given subset of CF-trees, wherein each CF-tree presents a respective set of local data as leaf entries; generating, by one or more processors, respective partial clustering solutions using the given subset of CF-trees, wherein clustering activity for a resulting CF-tree ceases when a threshold number of sub-clusters has been reached, and wherein a partial clustering solution includes a set of regular sub-clusters that are included in the resulting CF-tree and candidate outlier sub-clusters that contain summaries of sets of data records; generating, by one or more processors, initial regular clusters by performing hierarchical clustering using the set of regular sub-clusters; for each candidate outlier sub-cluster of a plurality of candidate outlier sub-clusters, determining, by one or more processors, a closest initial regular cluster, and a distance separating the respective candidate outlier sub-cluster and the respective closest initial regular cluster; and determining, by one or more processors, which candidate outlier sub-clusters are outlier clusters based, at least in part, on which candidate outlier sub-clusters have a computed distance to their respective closest initial regular cluster that is greater than a corresponding distance threshold associated with their respective closest initial regular cluster. 2. The method of claim 1 , the method further comprising: identifying, by one or more processors, one or more candidate outliers of the CF-trees and their closest leaf entries; determining, by one or more processors, whether merging a candidate outlier, included in the one or more candidate outliers, with its closest leaf entry will break a maximal tightness threshold; and responsive to a determination that merging the candidate outlier with its closest leaf entry will not break the maximal tightness threshold, merging, by one or more processors, that candidate outlier with its closest leaf entry. 3. The method of claim 1 , the method further comprising: merging, by one or more processors, one or more candidate outlier sub-clusters that are determined not to be outlier clusters with their respective closest initial regular clusters; and updating, by one or more processors, the distance threshold for one or more initial regular clusters that were merged with one or more candidate outlier sub-clusters. 4. The method of claim 1 , the method further comprising: determining, by one or more processors, a measure of outlier strength for candidate outlier sub-clusters. 5. The method of claim 1 , the method further comprising: computing, by one or more processors, an outlier strength for a plurality of outlier clusters; and sorting, by one or more processors, the plurality of outlier clusters according to outlier strength. 6. The method of claim 1 , the method further comprising: generating, by one or more processors, a specified number of outlier clusters, based, at least in part, on one or more of the following measures for each determined outlier cluster: a size of the outlier cluster, an outlier strength of the outlier cluster, and a probability of the outlier cluster belonging to a regular cluster. 7. The method of claim 1 , the method further comprising: determining, by one or more processors, a minimum outlier strength for a top percent of outlier clusters; and determining, by one or more processors, whether a new data record is an outlier based, at least in part on a comparison of a determined outlier strength for the new data record and the minimum outlier strength, wherein a determined outlier strength that is above the minimum outlier strength indicates that the new data record is an outlier.

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10042912B2 cover?
One or more processors initiate cluster feature (CF)-tree based hierarchical clustering on leaf entries of CF-trees included in a plurality of subsets. One or more processors, generate respective partial clustering solutions for the subsets. A partial clustering solution includes a set of regular sub-clusters and candidate outlier sub-clusters. One or more processors generate initial regular cl…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/285. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 07 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).