Per-attribute data clustering using tri-point data arbitration

US9514213B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9514213-B2
Application numberUS-201313833757-A
CountryUS
Kind codeB2
Filing dateMar 15, 2013
Priority dateMar 15, 2013
Publication dateDec 6, 2016
Grant dateDec 6, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods, and other embodiments associated with clustering using tri-point arbitration are described. In one embodiment, a method includes selecting a data point pair and a set of arbiter points. A tri-point arbitration similarity is calculated for data point pairs based, at least in part, on a distance between the first and second data points and the arbiter points. In one embodiment, similar data points are clustered.

First claim

Opening claim text (preview).

What is claimed is: 1. A non-transitory computer-readable medium storing computer-executable instructions that, when executed by of a computer cause the computer to perform functions, the instructions comprising instructions for: receiving a set of un-clustered data points to be grouped into one or more clusters; calculating respective distances between all pairs of data points in the set of un-clustered data points; selecting a set of one or more arbiter points that are representative of the set of un-clustered data points; computing a per-arbiter similarity for each pair of data points in the set of un-clustered data points, based at least in part on the distances between data points in the pair with respect to each arbiter point in the set of arbiter points, such that the similarity metric indicates that data points (x 1 ) and (x 2 ) in a given data point pair are similar with respect to a given arbiter point (a) when a distance between (x 1 ) and (x 2 ) is less than: i) a distance between (x 1 ) and (a) and ii) a distance between (x 2 ) and (a); combining the per-arbiter similarities for each data point pair to compute a similarity metric for each data point pair; identifying data points (x 1 ) and (x 2 ) as similar data points when the similarity metric for (x 1 ) and (x 2 ) exceeds a threshold; and grouping the similar data points into the one or more clusters. 2. The non-transitory computer-readable medium of claim 1 , wherein the instructions further comprise identifying similar data points by identifying pairs of data points that are determined to be similar with respect to an aggregation of arbiter points in the set of arbiter points. 3. The non-transitory computer-readable medium of claim 1 , wherein (x 1 ), (x 2 ), and (a) include one or more numerical attribute values, and further comprising instructions for determining the similarity between (x 1 ) and (x 2 ) based, at least in part, on Euclidean distances between the numerical attributes of (x 1 ) and (x 2 ) and (a). 4. The non-transitory computer-readable medium of claim 1 , further comprising instructions for determining the similarity between (x 1 ) and (x 2 ) and (a) using a set of if-then rules that specify a similarity value based on given values of (x 1 ) and (x 2 ) and (a). 5. The non-transitory computer-readable medium of claim 1 , wherein (x 1 ), (x 2 ), and (a) include one or more binary attribute values, and further comprising instructions for determining the similarity between the binary attributes of (x 1 ) and (x 2 ) and (a) as: i) 1 if a Hamming distance between (x 1 ) and (x 2 ) is less than both a Hamming distance between (x 1 ) and (a) and a Hamming distance between (x 2 ) and (a); ii) −1 if the Hamming distance between (x 1 ) and (x 2 ) is greater than either the Hamming distance between (x 1 ) and (a) or the Hamming distance between (x 2 ) and (a); and iii) 0 if the Hamming distance between (x i ) and (x 2 ) is equal to both the Hamming distance between (x 1 ) and (a) and the Hamming distance between (x 2 ) and (a). 6. The non-transitory computer-readable medium of claim 1 , wherein data points in the data set each comprise values for a plurality of attributes, further comprising instructions for: for each arbiter, computing the distance between data points (x 1 ) and (x 2 ), and a given arbiter (a) by combining selected per-attribute distances between the data points and (a); computing a per-arbiter similarity for each arbiter based, at least in part, on the distance; and combining the per-arbiter similarities to compute the similarity metric. 7. The non-transitory computer-readable medium of claim 1 , wherein data points in the data set each comprise a plurality of attribute values; further comprising instructions for: for each arbiter point (a) in the set of arbiter points, determining a set of per-attribute distances between selected attributes of the data points (x 1 ), (x 2 ), and (a), where each set of per-attribute distances includes distances on the same attribute between data points (x 1 ), (x 2 ) and the arbiter points; and for respective sets of per-attribute distances, computing respective per-attribute similarities based, at least in part, on the set of per-attribute distances; and combining the per-attribute similarities to compute the similarity metric. 8. The non-transitory computer-readable medium of claim 7 , further comprising instructions for combining sets of per-attribute distances for a selected group of attributes and computing a single per-attribute similarity for the combined sets of per-attribute distances. 9. The non-transitory computer-readable medium of claim 7 , further comprising instructions for determining the similarity metric based on a proportion of per-attribute similarities that indicate that (x 1 ) and (x 2 ) are similar. 10. The non-transitory computer-readable medium of claim 7 , further comprising instructions for i) determining a first per-attribute similarity based, at least in part, on a first subset of arbiters in A and ii) determining a second per-attribute similarity based on a second subset of arbiters in A that is different than the first subset. 11. The non-transitory computer-readable medium of claim 10 , further comprising instructions for i) determining the first per-attribute similarity based, at least in part, on an average of the first per-attribute distances and ii) determining the second per-attribute similarity based, at least in part, on an average of the second per-attribute distances. 12. The non-transitory computer-readable medium of claim 10 , further comprising instructions for combining the first per-attribute similarity with the second per-attribute similarity according to a weighting scheme. 13. The non-transitory computer-readable medium of claim 1 , wherein data points in the data set each comprise a plurality of attribute values, further comprising instructions for identifying attributes as redundant attributes that do not significantly contribute to similarity analysis for (x 1 ) and (x 2 ) by: selecting a subset of attributes; computing a second similarity metric for (x 1 ) and (x 2 ) by combining the similarity metrics for attributes that are not members of the subset; comparing the similarity metric with the second similarity metric; and when a difference between the similarity metric and the second similarity metric is below a threshold, identifying attributes in the subset of attributes as redundant attributes. 14. The non-transitory computer-readable medium of claim 1 , further comprising instructions for determining the threshold based, at least in part, on probability that a random pair of data points will be identified as similar by an aggregation of arbiter points. 15. The non-transitory computer-readable medium of claim 1 , further comprising instructions for creating the one or more clusters by: selecting a first data point and a second data point for membership in a cluster, wherein the first data point and the second data point are identified as similar; adding, as members of the cluster, data points that are similar to any member of the cluster; and until all data points are members of clusters: selecting a remaining data point that is not a member of a cluster to form a new cluster; adding data points that are similar to the remaining data point as members to the new cluster. 16. The non-transitory computer-readable medium of claim 15 , further comprising instructions for removing a member data point from a cluster when the member data point is not similar to an arbiter aggregation selected from

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9514213B2 cover?
Systems, methods, and other embodiments associated with clustering using tri-point arbitration are described. In one embodiment, a method includes selecting a data point pair and a set of arbiter points. A tri-point arbitration similarity is calculated for data point pairs based, at least in part, on a distance between the first and second data points and the arbiter points. In one embodiment, …
Who is the assignee on this patent?
Oracle Int Corp
What technology area does this patent fall under?
Primary CPC classification G06F16/285. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 06 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).