What technology area does this patent fall under?

Primary CPC classification G06N20/00. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jul 16 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Scalable clustering for mixed machine learning data

US10354201B1 · US · B1

Patent metadata
Field	Value
Publication number	US-10354201-B1
Application number	US-201614990171-A
Country	US
Kind code	B1
Filing date	Jan 7, 2016
Priority date	Jan 7, 2016
Publication date	Jul 16, 2019
Grant date	Jul 16, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A number of attributes of different attribute types, to be used to assign observation records of a data set to clusters, are identified. Attribute-type-specific distance metrics for the attributes, which can be combined to obtain a normalized aggregated distance of an observation record from a cluster representative, are selected. One or more iterations of a selected clustering methodology are implemented on the data set using resources of a machine learning service until targeted termination criteria are met. A given iteration includes assigning the observations to clusters of a current version of a clustering model based on the aggregated distances from the cluster representatives of the current version, and updating the cluster representatives to generate a new version of the clustering model.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: one or more computing devices of a machine learning service implemented at a provider network; wherein the one or more computing devices are configured to: identify a data source from which a plurality of observation records of a data set are to be obtained, wherein a particular observation record of the plurality of observation records comprises (a) a first attribute of a first attribute type of a set of attribute types, wherein members of the set include text attributes, numeric attributes and categorical attributes, and (b) a second attribute of a second attribute type of the set of attribute types; select a first distance metric associated with the first attribute type, and a second distance metric associated with the second attribute type, wherein the first and second distance metrics are to be used collectively to determine a multi-attribute distance of the particular observation record from a respective cluster representative of individual clusters of a plurality of clusters to which individual ones of the observation records of the data set are to be assigned using a particular clustering methodology; determine, using a subset of observation records of the data set, an initial version of a model of the data set, wherein the initial version of the model comprises a respective initial cluster representative associated with individual ones of the plurality of clusters, and wherein the subset of observation records excludes at least one observation record of the data set; perform one or more iterations of the particular clustering methodology, wherein an individual iteration of the plurality of iterations comprises: assigning, based at least in part on a respective multi-attribute distance of the particular observation record from individual cluster representatives of a particular version of the model of the data set, the particular observation record to a particular cluster of the plurality of clusters; and generating an updated version of the model of the data set, wherein said generating the other version comprises modifying at least one cluster representative included in the particular version of the model; in response to determining that a termination criterion of the particular clustering methodology has been met, store, with respect to one or more observation records of the data set, a respective indication of assignment of the observation record to a particular cluster of the plurality of clusters; and cause a user interface to display clustering results of the data set, wherein the user interface is configured to permit browsing of summary information of the individual clusters. 2. The system as recited in claim 1 , wherein determining that the termination criterion has been met comprises one or more of: (a) receiving an indication via a programmatic interface from a client of the machine learning service, or (b) determining, after a particular iteration of the one or more iterations, a relative convergence cost error metric associated with the particular iteration. 3. The system as recited in claim 1 , wherein the one or more computing devices are configured to: select, for a particular iteration of the one or more iterations, one or more execution platforms from a pool of execution platforms of the machine learning service, wherein the number of execution platforms selected is based at least in part on one or more of: (a) an estimate of a computation workload associated with the particular iteration, or (b) a utilization metric of the pool of execution platforms. 4. The system as recited in claim 1 , wherein the particular clustering methodology comprises a use of a version of one or more of: (a) a K-means algorithm, (b) a K-medians algorithm, (c) a K-harmonic-means algorithm, or (d) a MeanShift algorithm. 5. The system as recited in claim 1 , wherein the one or more computing devices are configured to: provide an indication, to a client via a programmatic interface, of (a) a first metric of discriminative utility associated with the first attribute, and (b) a second metric of discriminative utility associated with the second attribute. 6. A method, comprising: performing, by one or more computing devices: determining that a particular observation record of a data set includes a heterogeneous collection of attributes, including (a) a first attribute of a first attribute type of a set of attribute types and (b) a second attribute of a second attribute type of the set of attribute types, wherein the data set comprises a plurality of observation records including the particular observation record; selecting a first distance metric associated with the first attribute type, and a second distance metric associated with the second attribute type, wherein at least one distance metric of the first and second distance metrics is to be used to determine an aggregate distance of the particular observation record from a respective cluster representative of individual clusters of a plurality of clusters to which individual ones of the observation records of the data set are to be assigned using a particular clustering methodology; performing, using one or more resources of a network-accessible machine learning service, one or more iterations of the particular clustering methodology, wherein an individual iteration of the plurality of iterations comprises: assigning, based at least in part on a respective aggregate distance of the particular observation record from cluster representatives of a particular version of a model of the data set, the particular observation record to a particular cluster of the plurality of clusters; and generating an updated version of the model of the data set, wherein said generating the updated version comprises modifying at least one cluster representative included in the particular version of the model; in response to detecting that a termination criterion of the particular clustering methodology has been met, storing, with respect to one or more observation records of the data set, a respective indication of assignment of the observation record to a particular cluster of the plurality of clusters; and causing a user interface to display clustering results of the data set, wherein the user interface is configured to permit browsing of summary information of the individual clusters. 7. The method as recited in claim 6 , wherein said detecting that the termination criterion has been met comprises determining, after a particular iteration of the plurality of iterations has been completed, that an estimate of a relative convergence cost error metric corresponding to the particular iteration has reached a threshold value. 8. The method as recited in claim 7 , wherein the estimate of the relative convergence cost error metric is based at least in part on one or more of: (a) the total number of iterations which have been completed, (b) a fraction of observation records of the data set whose cluster assignment changed during the particular iteration, or (c) a relative change in a cost function computed during the particular iteration. 9. The method as recited in claim 6 , further comprising performing, by the one or more computing devices: selecting, for a particular iteration of the one or more iterations, one or more execution platforms from a pool of execution platforms of the machine learning service. 10. The method as recited in claim 6 , wherein the particular clustering methodology comprises a use of one or more of: (a) a K-means algorithm, (b) a K-medians algorithm, (c) a K-harmonic-means algorithm, or (d) a MeanShift algorithm. 11. The method as recited in claim 6 , wherein the observation records of th

Assignees

Amazon Tech Inc

Inventors

Classifications

G06N20/00Primary
Machine learning · CPC title
G06F16/285Primary
Clustering or classification · CPC title

Patent family

Related publications grouped by family.

View patent family 67220544

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10354201B1 cover?: A number of attributes of different attribute types, to be used to assign observation records of a data set to clusters, are identified. Attribute-type-specific distance metrics for the attributes, which can be combined to obtain a normalized aggregated distance of an observation record from a cluster representative, are selected. One or more iterations of a selected clustering methodology are …
Who is the assignee on this patent?: Amazon Tech Inc
What technology area does this patent fall under?: Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jul 16 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Persona based data mining system

Probabilistic matrix factorization system based on personas

Item attribute based data mining system

Publishing RDF quads as relational views

Opinion aggregation system

Method and system for presenting RDF data as a set of relational views

Frequently asked questions