Scalable clustering for mixed machine learning data

US10354201B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10354201-B1
Application numberUS-201614990171-A
CountryUS
Kind codeB1
Filing dateJan 7, 2016
Priority dateJan 7, 2016
Publication dateJul 16, 2019
Grant dateJul 16, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A number of attributes of different attribute types, to be used to assign observation records of a data set to clusters, are identified. Attribute-type-specific distance metrics for the attributes, which can be combined to obtain a normalized aggregated distance of an observation record from a cluster representative, are selected. One or more iterations of a selected clustering methodology are implemented on the data set using resources of a machine learning service until targeted termination criteria are met. A given iteration includes assigning the observations to clusters of a current version of a clustering model based on the aggregated distances from the cluster representatives of the current version, and updating the cluster representatives to generate a new version of the clustering model.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: one or more computing devices of a machine learning service implemented at a provider network; wherein the one or more computing devices are configured to: identify a data source from which a plurality of observation records of a data set are to be obtained, wherein a particular observation record of the plurality of observation records comprises (a) a first attribute of a first attribute type of a set of attribute types, wherein members of the set include text attributes, numeric attributes and categorical attributes, and (b) a second attribute of a second attribute type of the set of attribute types; select a first distance metric associated with the first attribute type, and a second distance metric associated with the second attribute type, wherein the first and second distance metrics are to be used collectively to determine a multi-attribute distance of the particular observation record from a respective cluster representative of individual clusters of a plurality of clusters to which individual ones of the observation records of the data set are to be assigned using a particular clustering methodology; determine, using a subset of observation records of the data set, an initial version of a model of the data set, wherein the initial version of the model comprises a respective initial cluster representative associated with individual ones of the plurality of clusters, and wherein the subset of observation records excludes at least one observation record of the data set; perform one or more iterations of the particular clustering methodology, wherein an individual iteration of the plurality of iterations comprises: assigning, based at least in part on a respective multi-attribute distance of the particular observation record from individual cluster representatives of a particular version of the model of the data set, the particular observation record to a particular cluster of the plurality of clusters; and generating an updated version of the model of the data set, wherein said generating the other version comprises modifying at least one cluster representative included in the particular version of the model; in response to determining that a termination criterion of the particular clustering methodology has been met, store, with respect to one or more observation records of the data set, a respective indication of assignment of the observation record to a particular cluster of the plurality of clusters; and cause a user interface to display clustering results of the data set, wherein the user interface is configured to permit browsing of summary information of the individual clusters. 2. The system as recited in claim 1 , wherein determining that the termination criterion has been met comprises one or more of: (a) receiving an indication via a programmatic interface from a client of the machine learning service, or (b) determining, after a particular iteration of the one or more iterations, a relative convergence cost error metric associated with the particular iteration. 3. The system as recited in claim 1 , wherein the one or more computing devices are configured to: select, for a particular iteration of the one or more iterations, one or more execution platforms from a pool of execution platforms of the machine learning service, wherein the number of execution platforms selected is based at least in part on one or more of: (a) an estimate of a computation workload associated with the particular iteration, or (b) a utilization metric of the pool of execution platforms. 4. The system as recited in claim 1 , wherein the particular clustering methodology comprises a use of a version of one or more of: (a) a K-means algorithm, (b) a K-medians algorithm, (c) a K-harmonic-means algorithm, or (d) a MeanShift algorithm. 5. The system as recited in claim 1 , wherein the one or more computing devices are configured to: provide an indication, to a client via a programmatic interface, of (a) a first metric of discriminative utility associated with the first attribute, and (b) a second metric of discriminative utility associated with the second attribute. 6. A method, comprising: performing, by one or more computing devices: determining that a particular observation record of a data set includes a heterogeneous collection of attributes, including (a) a first attribute of a first attribute type of a set of attribute types and (b) a second attribute of a second attribute type of the set of attribute types, wherein the data set comprises a plurality of observation records including the particular observation record; selecting a first distance metric associated with the first attribute type, and a second distance metric associated with the second attribute type, wherein at least one distance metric of the first and second distance metrics is to be used to determine an aggregate distance of the particular observation record from a respective cluster representative of individual clusters of a plurality of clusters to which individual ones of the observation records of the data set are to be assigned using a particular clustering methodology; performing, using one or more resources of a network-accessible machine learning service, one or more iterations of the particular clustering methodology, wherein an individual iteration of the plurality of iterations comprises: assigning, based at least in part on a respective aggregate distance of the particular observation record from cluster representatives of a particular version of a model of the data set, the particular observation record to a particular cluster of the plurality of clusters; and generating an updated version of the model of the data set, wherein said generating the updated version comprises modifying at least one cluster representative included in the particular version of the model; in response to detecting that a termination criterion of the particular clustering methodology has been met, storing, with respect to one or more observation records of the data set, a respective indication of assignment of the observation record to a particular cluster of the plurality of clusters; and causing a user interface to display clustering results of the data set, wherein the user interface is configured to permit browsing of summary information of the individual clusters. 7. The method as recited in claim 6 , wherein said detecting that the termination criterion has been met comprises determining, after a particular iteration of the plurality of iterations has been completed, that an estimate of a relative convergence cost error metric corresponding to the particular iteration has reached a threshold value. 8. The method as recited in claim 7 , wherein the estimate of the relative convergence cost error metric is based at least in part on one or more of: (a) the total number of iterations which have been completed, (b) a fraction of observation records of the data set whose cluster assignment changed during the particular iteration, or (c) a relative change in a cost function computed during the particular iteration. 9. The method as recited in claim 6 , further comprising performing, by the one or more computing devices: selecting, for a particular iteration of the one or more iterations, one or more execution platforms from a pool of execution platforms of the machine learning service. 10. The method as recited in claim 6 , wherein the particular clustering methodology comprises a use of one or more of: (a) a K-means algorithm, (b) a K-medians algorithm, (c) a K-harmonic-means algorithm, or (d) a MeanShift algorithm. 11. The method as recited in claim 6 , wherein the observation records of th

Assignees

Inventors

Classifications

  • G06N20/00Primary

    Machine learning · CPC title

  • G06F16/285Primary

    Clustering or classification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10354201B1 cover?
A number of attributes of different attribute types, to be used to assign observation records of a data set to clusters, are identified. Attribute-type-specific distance metrics for the attributes, which can be combined to obtain a normalized aggregated distance of an observation record from a cluster representative, are selected. One or more iterations of a selected clustering methodology are …
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 16 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).