Number of clusters estimation

US9424337B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9424337-B2
Application numberUS-201414196299-A
CountryUS
Kind codeB2
Filing dateMar 4, 2014
Priority dateJul 9, 2013
Publication dateAug 23, 2016
Grant dateAug 23, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method of determining a number of clusters for a dataset is provided. Centroid locations for a defined number of clusters are determined using a clustering algorithm. Boundaries for each of the defined clusters are defined. A reference distribution that includes a plurality of data points is created. The plurality of data points are within the defined boundary of at least one cluster of the defined clusters. Second centroid locations for the defined number of clusters are determined using the clustering algorithm and the reference distribution. A gap statistic for the defined number of clusters based on a comparison between a first residual sum of squares and a second residual sum of squares is computed. The processing is repeated for a next number of clusters to create. An estimated best number of clusters for the received data is determined by comparing the gap statistic computed for each iteration of the number of clusters.

First claim

Opening claim text (preview).

What is claimed is: 1. A non-transitory computer-readable medium having stored thereon computer-readable instructions that when executed by a computing device cause the computing device to: receive data to cluster; define a number of clusters to create; (a) determine centroid locations for the defined number of clusters using a clustering algorithm and the received data to define clusters; (b) define boundaries for each of the defined clusters by determining an eigenvector and an eigenvalue for each dimension of each cluster of the defined clusters using principal components analysis; determining a length for each dimension of each cluster as a proportion of the determined eigenvalue for the respective dimension; and defining the boundaries for each cluster of the defined clusters as a box with a center of the box as the determined centroid location of the respective cluster, a first boundary point for each dimension defined as the center plus the determined length of the respective dimension aligned with the determined eigenvector of the respective dimension, and a second boundary point for each dimension defined as the center minus the determined length of the respective dimension aligned with the eigenvector of the respective dimension; (c) create a reference distribution that includes a plurality of data points, wherein the plurality of data points are within the defined boundary of at least one cluster of the defined clusters; (d) determine second centroid locations for the defined number of clusters using the clustering algorithm and the created reference distribution to define second clusters; (e) compute a gap statistic for the defined number of clusters based on a comparison between a first residual sum of squares computed for the defined clusters and a second residual sum of squares computed for the defined second clusters; (f) repeat (a) to (e) with a next number of clusters to create as the defined number of clusters; and (g) determine an estimated best number of clusters for the received data by comparing the gap statistic computed for each iteration of (e). 2. The computer-readable medium of claim 1 , wherein the gap statistic for the defined number of clusters is computed using gap(k)=log (W k *)−log(W k ), where W k = ∑ j = 1 k ⁢ ∑ i = 1 n j ⁢  x i , j - c j  2 is the first residual sum of squares computed for the defined clusters, and W k * = ∑ j = 1 k ⁢ ∑ i = 1 n j * ⁢  x i , j * - c j *  2 is the second residual sum of squares computed for the defined second clusters, where k is the defined number of clusters, n j is the number of data points in cluster j of the defined clusters, x i,j is the ith data point in cluster j of the defined clusters, c j is the centroid location of cluster j of the defined clusters, n j * is the number of data points in cluster j of the defined second clusters, x i,j * is the ith data point in cluster j of the defined second clusters, and c j * is the centroid location of cluster j of the defined second clusters. 3. The computer-readable medium of claim 2 , wherein (c) and (d) are repeated a plurality of times and wherein W k * = 1 B ⁢ ∑ b = 1 B ⁢ ∑ j = 1 k ⁢ ∑ i = 1 n j * ⁢  x i , j * - c j *  2

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9424337B2 cover?
A method of determining a number of clusters for a dataset is provided. Centroid locations for a defined number of clusters are determined using a clustering algorithm. Boundaries for each of the defined clusters are defined. A reference distribution that includes a plurality of data points is created. The plurality of data points are within the defined boundary of at least one cluster of the d…
Who is the assignee on this patent?
Sas Inst Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/285. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 23 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).