Persona based data mining system
US-10157351-B1 · Dec 18, 2018 · US
US10354201B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10354201-B1 |
| Application number | US-201614990171-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jan 7, 2016 |
| Priority date | Jan 7, 2016 |
| Publication date | Jul 16, 2019 |
| Grant date | Jul 16, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A number of attributes of different attribute types, to be used to assign observation records of a data set to clusters, are identified. Attribute-type-specific distance metrics for the attributes, which can be combined to obtain a normalized aggregated distance of an observation record from a cluster representative, are selected. One or more iterations of a selected clustering methodology are implemented on the data set using resources of a machine learning service until targeted termination criteria are met. A given iteration includes assigning the observations to clusters of a current version of a clustering model based on the aggregated distances from the cluster representatives of the current version, and updating the cluster representatives to generate a new version of the clustering model.
Opening claim text (preview).
What is claimed is: 1. A system, comprising: one or more computing devices of a machine learning service implemented at a provider network; wherein the one or more computing devices are configured to: identify a data source from which a plurality of observation records of a data set are to be obtained, wherein a particular observation record of the plurality of observation records comprises (a) a first attribute of a first attribute type of a set of attribute types, wherein members of the set include text attributes, numeric attributes and categorical attributes, and (b) a second attribute of a second attribute type of the set of attribute types; select a first distance metric associated with the first attribute type, and a second distance metric associated with the second attribute type, wherein the first and second distance metrics are to be used collectively to determine a multi-attribute distance of the particular observation record from a respective cluster representative of individual clusters of a plurality of clusters to which individual ones of the observation records of the data set are to be assigned using a particular clustering methodology; determine, using a subset of observation records of the data set, an initial version of a model of the data set, wherein the initial version of the model comprises a respective initial cluster representative associated with individual ones of the plurality of clusters, and wherein the subset of observation records excludes at least one observation record of the data set; perform one or more iterations of the particular clustering methodology, wherein an individual iteration of the plurality of iterations comprises: assigning, based at least in part on a respective multi-attribute distance of the particular observation record from individual cluster representatives of a particular version of the model of the data set, the particular observation record to a particular cluster of the plurality of clusters; and generating an updated version of the model of the data set, wherein said generating the other version comprises modifying at least one cluster representative included in the particular version of the model; in response to determining that a termination criterion of the particular clustering methodology has been met, store, with respect to one or more observation records of the data set, a respective indication of assignment of the observation record to a particular cluster of the plurality of clusters; and cause a user interface to display clustering results of the data set, wherein the user interface is configured to permit browsing of summary information of the individual clusters. 2. The system as recited in claim 1 , wherein determining that the termination criterion has been met comprises one or more of: (a) receiving an indication via a programmatic interface from a client of the machine learning service, or (b) determining, after a particular iteration of the one or more iterations, a relative convergence cost error metric associated with the particular iteration. 3. The system as recited in claim 1 , wherein the one or more computing devices are configured to: select, for a particular iteration of the one or more iterations, one or more execution platforms from a pool of execution platforms of the machine learning service, wherein the number of execution platforms selected is based at least in part on one or more of: (a) an estimate of a computation workload associated with the particular iteration, or (b) a utilization metric of the pool of execution platforms. 4. The system as recited in claim 1 , wherein the particular clustering methodology comprises a use of a version of one or more of: (a) a K-means algorithm, (b) a K-medians algorithm, (c) a K-harmonic-means algorithm, or (d) a MeanShift algorithm. 5. The system as recited in claim 1 , wherein the one or more computing devices are configured to: provide an indication, to a client via a programmatic interface, of (a) a first metric of discriminative utility associated with the first attribute, and (b) a second metric of discriminative utility associated with the second attribute. 6. A method, comprising: performing, by one or more computing devices: determining that a particular observation record of a data set includes a heterogeneous collection of attributes, including (a) a first attribute of a first attribute type of a set of attribute types and (b) a second attribute of a second attribute type of the set of attribute types, wherein the data set comprises a plurality of observation records including the particular observation record; selecting a first distance metric associated with the first attribute type, and a second distance metric associated with the second attribute type, wherein at least one distance metric of the first and second distance metrics is to be used to determine an aggregate distance of the particular observation record from a respective cluster representative of individual clusters of a plurality of clusters to which individual ones of the observation records of the data set are to be assigned using a particular clustering methodology; performing, using one or more resources of a network-accessible machine learning service, one or more iterations of the particular clustering methodology, wherein an individual iteration of the plurality of iterations comprises: assigning, based at least in part on a respective aggregate distance of the particular observation record from cluster representatives of a particular version of a model of the data set, the particular observation record to a particular cluster of the plurality of clusters; and generating an updated version of the model of the data set, wherein said generating the updated version comprises modifying at least one cluster representative included in the particular version of the model; in response to detecting that a termination criterion of the particular clustering methodology has been met, storing, with respect to one or more observation records of the data set, a respective indication of assignment of the observation record to a particular cluster of the plurality of clusters; and causing a user interface to display clustering results of the data set, wherein the user interface is configured to permit browsing of summary information of the individual clusters. 7. The method as recited in claim 6 , wherein said detecting that the termination criterion has been met comprises determining, after a particular iteration of the plurality of iterations has been completed, that an estimate of a relative convergence cost error metric corresponding to the particular iteration has reached a threshold value. 8. The method as recited in claim 7 , wherein the estimate of the relative convergence cost error metric is based at least in part on one or more of: (a) the total number of iterations which have been completed, (b) a fraction of observation records of the data set whose cluster assignment changed during the particular iteration, or (c) a relative change in a cost function computed during the particular iteration. 9. The method as recited in claim 6 , further comprising performing, by the one or more computing devices: selecting, for a particular iteration of the one or more iterations, one or more execution platforms from a pool of execution platforms of the machine learning service. 10. The method as recited in claim 6 , wherein the particular clustering methodology comprises a use of one or more of: (a) a K-means algorithm, (b) a K-medians algorithm, (c) a K-harmonic-means algorithm, or (d) a MeanShift algorithm. 11. The method as recited in claim 6 , wherein the observation records of th
Machine learning · CPC title
Clustering or classification · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.