Concurrent binning of machine learning data
US-2015379428-A1 · Dec 31, 2015 · US
US2016132787A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2016132787-A1 |
| Application number | US-201514598628-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jan 16, 2015 |
| Priority date | Nov 11, 2014 |
| Publication date | May 12, 2016 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system is provided for multi-methodology, multi-user, self-optimizing Machine Learning as a Service for that automates and optimizes the model training process. The system uses a large-scale distributed architecture and is compatible with cloud services. The system uses a hybrid optimization technique to select between multiple machine learning approaches for a given dataset. The system can also use datasets to transferring knowledge of how one modeling methodology has previously worked over to a new problem.
Opening claim text (preview).
What is claimed is: 1 . A system to automate selection and training of machine learning models across multiple modeling methodologies, the system comprising: a model methodology repository configured to store one or more model methodology implementations, each of the model methodology implementations associated with a modeling methodology; a dataset repository configured to store datasets; a data hub configured to store data run records and performance records; a dataset upload interface (UI) configured to receive a dataset, store the received dataset within the dataset repository, to generate a data run record comprising the location of received dataset within the dataset repository, and to store the generated data run record to the data hub; and a processing cluster comprising a plurality of worker nodes, each of the worker nodes configured to select a data run record from the data hub, to select a dataset from the dataset repository, to select a modeling methodology from the model methodology repository; to generate a parameterization within with the model methodology, to generate a model having the selected modeling methodology and generated parameterization, to train the generated model on the selected dataset, to evaluate the performance of the trained model on the selected dataset, to generate a performance record, and to store the generated performance record to the data hub. 2 . The system of claim 1 wherein each of the data run records comprising a dataset location identifying one of the stored datasets within the dataset repository, wherein the each of the worker nodes is configured to select a dataset from the dataset repository based upon the dataset location identified by the data run record. 3 . The system of claim 2 wherein each of the performance records is associated with a data run record and a modeling methodology, each of the performance records comprising a parameterization within the associated modeling methodology and performance data indicating the performance of the model parameterization on the associated dataset, wherein each of the worker nodes is configured to and to generate a performance record comprising the evaluated performance and associated with the selected data run, the selected modeling methodology, and the generated parameterization. 4 . The system of claim 2 wherein the dataset UI is further configured to receive one or more parameters and to store the one of more parameters with a data run record. 5 . The system of claim 4 wherein the parameters include a wall time budget, a performance threshold, number of models to evaluate, or a performance metric. 6 . The system of claim 5 wherein at least one of the worker nodes is configured to correlate the performance of models on a first dataset to the performance of models on a second dataset. 7 . The system of claim 5 wherein at least one of the worker nodes is configured to use a Bandit strategy to optimize a model for a dataset. 8 . The system of claim 7 wherein the parameters include a Bandit strategy memory type, a Bandit strategy reward type, or a Bandit strategy grouping type. 9 . The system of claim 7 wherein at least one of the worker nodes is configured to use a Gaussian Process (GP) model to select a model for a dataset, wherein the selected model maximizes an acquisition function. 10 . The system of claim 9 wherein the parameters include the acquisition function. 11 . The system of claim 1 further comprising a trained model repository, wherein at least one of the worker nodes is configured to store a trained model within the trained model repository. 12 . A method for machine learning comprising: (a) generating a plurality modeling possibilities across a plurality of modeling methodologies; (b) receiving a first dataset; (c) selecting a first plurality of models from the modeling possibilities; (d) evaluating a performance of each one of the first plurality of models on the first dataset; (e) receiving a second dataset; (f) selecting a second plurality of models from the modeling possibilities; (g) evaluating a performance of each one of the second plurality of models on the second dataset; (h) receiving a third dataset; (i) selecting a third plurality of models from the modeling possibilities; (j) evaluating a performance of each one of the third plurality of models on the third dataset; (k) generating a first performance vector comprising the performance of each one of the first plurality of models on the first dataset; (l) generating a second performance vector comprising the performance of each one of the second plurality of models on the second dataset; (m) generating a third performance vector comprising the performance of each one of the third plurality of models on the third dataset; (n) selecting from the first and second datasets, the most similar dataset based upon comparing a similarity between the first and third performance vectors and a similarity between the second and third performance vectors; (o) among the models trained for the most similar dataset, select the one with the highest performance on the most similar dataset; (p) evaluating a performance of the selected model on the third dataset; (q) add the performance of the selected model on the third dataset to the third performance vector; and (r) returning a model from the third performance vector having a highest performance of models in the third performance vector. 13 . The method of claim 12 wherein the steps (n)-(r) are repeated until the model having the highest performance from the third performance vector has a performance greater than or equal to a predetermined performance threshold. 14 . The method of claim 12 wherein the steps (n)-(r) are repeated until a predetermined wall time budget is exceeded. 15 . The method of claim 12 wherein the steps (n)-(r) are repeated until performance of a predetermined number of models is evaluated. 16 . The method of claim 12 wherein evaluating the performance of each one of the first plurality of models on the first dataset comprises storing a plurality of performances records to a database, wherein generate a first performance vector comprising the performance of each one of the first plurality of models on the first dataset comprises retrieving the first plurality of performance records from the database, wherein each of the plurality of performance records is associated with the first dataset and one of the first plurality of models, wherein each of the plurality of performance records comprises performance data indicating the performance of the associated model on the first dataset. 17 . The method of claim 12 further comprising: estimating the performance of one or more of the modeling possibilities not in the third plurality of models on the third dataset using collaborative filtering or matrix factorization techniques; and adding the estimated performances to the third performance vector. 18 . The method of claim 12 wherein generating a plurality modeling possibilities across a plurality of modeling methodologies comprises: enumerating a plurality of hyperpartitions across a plurality of modeling methodologies; and for optimizable model parameters and hyperparameters, choose a feasible step size to derive a plurality of modeling possibilities. 19 . A method for machine learning comprising: (a) receiving a dataset; (b) enumerating a plurality of hyperpartitions across a plurality of modeling methodologies; (c) generating a pluralit
Physics · mapped topic
using kernel methods, e.g. support vector machines [SVM] · CPC title
Machine learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.