Multilevel clustering of vector-based data
US-2021224583-A1 · Jul 22, 2021 · US
US11449704B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11449704-B2 |
| Application number | US-202016744241-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 16, 2020 |
| Priority date | Jan 16, 2020 |
| Publication date | Sep 20, 2022 |
| Grant date | Sep 20, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A multilevel clustered data set for multidimensional vectors is created by defining a plurality of clusters based on each of the signed dimensions of the vectors, each dimension functioning as an axis. Vectors are assigned to each cluster by measuring cosine similarity between a vector and each axis. Sub-clusters are defined as ranges of cosine similarity values within a cluster, and each vector is assigned into the appropriate range based on their cosine similarity value with the axis of the cluster. Searching for a matching vector to a new vector is efficiently achieved in near-constant time by measuring cosine similarity for the new vector with each axis to identify the closest cluster, reusing the cosine similarity of the new vector and axis to determine which sub-cluster corresponds to the appropriate range of values, and then comparing each vector within the sub-cluster until a match is found or ruled out.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method (CIM) comprising: receiving a clustered images data set, with the clustered images data set including a plurality of top-level clusters, where a given top-level cluster is determined based on a signed axis and includes a plurality of sub-clusters, where a given sub-cluster is a range of values based, at least in part, on the signed axis of the given top-level cluster and includes one or more multidimensional vectors generated from historical images; determining, from the plurality of top-level clusters, a subset of top-level clusters for removal based, at least in part, on relative similarity of the multidimensional vectors within the sub-clusters of the subset of top-level clusters compared to the multidimensional vectors within the sub-clusters of the other top-level clusters of the plurality of top-level clusters; removing the subset of top-level clusters from the plurality of top-level clusters; receiving an input image data set; generating a multidimensional vector based on the input image data set; determining a top-level cluster closest to the generated multidimensional vector based, at least in part, on the signed axes of the plurality of top-level clusters; determining a sub-cluster of the determined top-level cluster closest to the generated multidimensional vector based, at least in part, on the signed axis of the determined top-level cluster and the generated multidimensional vector; and determining a subset of one or more vectors of the determined sub-cluster as matches for the input image by comparing the generated multidimensional vector to one or more vectors of the determined sub-cluster. 2. The CIM of claim 1 , wherein: the vectors of the clustered images data set are based upon biometric facial image scans; and the input image data set is a biometric facial image scan. 3. The CIM of claim 1 , wherein: the clustered images data set includes 256 top-level clusters based on vectors with 128 dimensions; and each dimension includes a positive and negative sign. 4. The CIM of claim 1 , wherein determining the top-level cluster closest to the generated multidimensional vector includes: measuring a cosine similarity value between the signed axis of each top-level cluster and the generated multidimensional vector; and selecting the top-level cluster with the measured cosine similarity value closest or equal to 1. 5. The CIM of claim 4 , wherein the sub-clusters of a given top-level cluster of the clustered images data set are defined as ranges of cosine similarity values measured from the signed axis of the given top-level cluster and the one or more multidimensional vectors generated from historical images assigned to the top-level cluster. 6. The CIM of claim 5 , wherein determining the sub-cluster of the determined top-level cluster closest to the generated multidimensional vector includes: determining which sub-cluster is defined by a range of values which includes the measured cosine similarity value between the signed axis of each top-level cluster and the generated multidimensional vector. 7. The CIM of claim 6 , wherein the determining a subset of one or more vectors of the determined sub-cluster as matches for the input image by comparing the generated multidimensional vector to the plurality of vectors assigned to the determined sub-cluster includes: comparing the cosine similarity values of the generated multidimensional vector and the axis of the determined top-level cluster with the cosine similarity values of the axis of the determined top-level cluster and each vector in the determined sub-cluster; and determining a match for inclusion into the subset for each vector corresponding to a cosine similarity value within a predetermined threshold value of the cosine similarity value corresponding to the generated multidimensional vector. 8. A computer program product (CPP) comprising: a machine readable storage device; and computer code stored on the machine readable storage device, with the computer code including instructions for causing a processor(s) set to perform operations including the following: receiving a clustered images data set, with the clustered images data set including a plurality of top-level clusters, where a given top-level cluster is determined based on a signed axis and includes a plurality of sub-clusters, where a given sub-cluster is a range of values based, at least in part, on the signed axis of the given top-level cluster and includes one or more multidimensional vectors generated from historical images, determining, from the plurality of top-level clusters, a subset of top-level clusters for removal based, at least in part, on relative similarity of the multidimensional vectors within the sub-clusters of the subset of top-level clusters compared to the multidimensional vectors within the sub-clusters of the other top-level clusters of the plurality of top-level clusters, removing the subset of top-level clusters from the plurality of top-level clusters, receiving an input image data set, generating a multidimensional vector based on the input image data set, determining a top-level cluster closest to the generated multidimensional vector based, at least in part, on the signed axes of the plurality of top-level clusters, determining a sub-cluster of the determined top-level cluster closest to the generated multidimensional vector based, at least in part, on the signed axis of the determined top-level cluster and the generated multidimensional vector, and determining a subset of one or more vectors of the determined sub-cluster as matches for the input image by comparing the generated multidimensional vector to one or more vectors of the determined sub-cluster. 9. The CPP of claim 8 , wherein: the vectors of the clustered images data set are based upon biometric facial image scans; and the input image data set is a biometric facial image scan. 10. The CPP of claim 8 , wherein: the clustered images data set includes 256 top-level clusters based on vectors with 128 dimensions; and each dimension includes a positive and negative sign. 11. The CPP of claim 8 , wherein determining the top-level cluster closest to the generated multidimensional vector includes: measuring a cosine similarity value between the signed axis of each top-level cluster and the generated multidimensional vector; and selecting the top-level cluster with the measured cosine similarity value closest or equal to 1. 12. The CPP of claim 11 , wherein the sub-clusters of a given top-level cluster of the clustered images data set are defined as ranges of cosine similarity values measured from the signed axis of the given top-level cluster and the one or more multidimensional vectors generated from historical images assigned to the top-level cluster. 13. The CPP of claim 12 , wherein determining the sub-cluster of the determined top-level cluster closest to the generated multidimensional vector includes: determining which sub-cluster is defined by a range of values which includes the measured cosine similarity value between the signed axis of each top-level cluster and the generated multidimensional vector. 14. The CPP of claim 13 , wherein the determining a subset of one or more vectors of the determined sub-cluster as matches for the input image by comparing the generated multidimensional vector to the plurality of vectors assigned to the determined sub-cluster includes: comparing the cosine similarity values of the generated multidimensional vector and the axis of the determined top-level cluster with the cosine similarity val
Hierarchical techniques, i.e. dividing or merging patterns to obtain a tree-like representation; Dendograms · CPC title
Feature selection, e.g. selecting representative features from a multi-dimensional feature space · CPC title
Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram · CPC title
Classification, e.g. identification · CPC title
by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.