Distributed, multi-model, self-learning platform for machine learning

US2016132787A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016132787-A1
Application numberUS-201514598628-A
CountryUS
Kind codeA1
Filing dateJan 16, 2015
Priority dateNov 11, 2014
Publication dateMay 12, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system is provided for multi-methodology, multi-user, self-optimizing Machine Learning as a Service for that automates and optimizes the model training process. The system uses a large-scale distributed architecture and is compatible with cloud services. The system uses a hybrid optimization technique to select between multiple machine learning approaches for a given dataset. The system can also use datasets to transferring knowledge of how one modeling methodology has previously worked over to a new problem.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system to automate selection and training of machine learning models across multiple modeling methodologies, the system comprising: a model methodology repository configured to store one or more model methodology implementations, each of the model methodology implementations associated with a modeling methodology; a dataset repository configured to store datasets; a data hub configured to store data run records and performance records; a dataset upload interface (UI) configured to receive a dataset, store the received dataset within the dataset repository, to generate a data run record comprising the location of received dataset within the dataset repository, and to store the generated data run record to the data hub; and a processing cluster comprising a plurality of worker nodes, each of the worker nodes configured to select a data run record from the data hub, to select a dataset from the dataset repository, to select a modeling methodology from the model methodology repository; to generate a parameterization within with the model methodology, to generate a model having the selected modeling methodology and generated parameterization, to train the generated model on the selected dataset, to evaluate the performance of the trained model on the selected dataset, to generate a performance record, and to store the generated performance record to the data hub. 2 . The system of claim 1 wherein each of the data run records comprising a dataset location identifying one of the stored datasets within the dataset repository, wherein the each of the worker nodes is configured to select a dataset from the dataset repository based upon the dataset location identified by the data run record. 3 . The system of claim 2 wherein each of the performance records is associated with a data run record and a modeling methodology, each of the performance records comprising a parameterization within the associated modeling methodology and performance data indicating the performance of the model parameterization on the associated dataset, wherein each of the worker nodes is configured to and to generate a performance record comprising the evaluated performance and associated with the selected data run, the selected modeling methodology, and the generated parameterization. 4 . The system of claim 2 wherein the dataset UI is further configured to receive one or more parameters and to store the one of more parameters with a data run record. 5 . The system of claim 4 wherein the parameters include a wall time budget, a performance threshold, number of models to evaluate, or a performance metric. 6 . The system of claim 5 wherein at least one of the worker nodes is configured to correlate the performance of models on a first dataset to the performance of models on a second dataset. 7 . The system of claim 5 wherein at least one of the worker nodes is configured to use a Bandit strategy to optimize a model for a dataset. 8 . The system of claim 7 wherein the parameters include a Bandit strategy memory type, a Bandit strategy reward type, or a Bandit strategy grouping type. 9 . The system of claim 7 wherein at least one of the worker nodes is configured to use a Gaussian Process (GP) model to select a model for a dataset, wherein the selected model maximizes an acquisition function. 10 . The system of claim 9 wherein the parameters include the acquisition function. 11 . The system of claim 1 further comprising a trained model repository, wherein at least one of the worker nodes is configured to store a trained model within the trained model repository. 12 . A method for machine learning comprising: (a) generating a plurality modeling possibilities across a plurality of modeling methodologies; (b) receiving a first dataset; (c) selecting a first plurality of models from the modeling possibilities; (d) evaluating a performance of each one of the first plurality of models on the first dataset; (e) receiving a second dataset; (f) selecting a second plurality of models from the modeling possibilities; (g) evaluating a performance of each one of the second plurality of models on the second dataset; (h) receiving a third dataset; (i) selecting a third plurality of models from the modeling possibilities; (j) evaluating a performance of each one of the third plurality of models on the third dataset; (k) generating a first performance vector comprising the performance of each one of the first plurality of models on the first dataset; (l) generating a second performance vector comprising the performance of each one of the second plurality of models on the second dataset; (m) generating a third performance vector comprising the performance of each one of the third plurality of models on the third dataset; (n) selecting from the first and second datasets, the most similar dataset based upon comparing a similarity between the first and third performance vectors and a similarity between the second and third performance vectors; (o) among the models trained for the most similar dataset, select the one with the highest performance on the most similar dataset; (p) evaluating a performance of the selected model on the third dataset; (q) add the performance of the selected model on the third dataset to the third performance vector; and (r) returning a model from the third performance vector having a highest performance of models in the third performance vector. 13 . The method of claim 12 wherein the steps (n)-(r) are repeated until the model having the highest performance from the third performance vector has a performance greater than or equal to a predetermined performance threshold. 14 . The method of claim 12 wherein the steps (n)-(r) are repeated until a predetermined wall time budget is exceeded. 15 . The method of claim 12 wherein the steps (n)-(r) are repeated until performance of a predetermined number of models is evaluated. 16 . The method of claim 12 wherein evaluating the performance of each one of the first plurality of models on the first dataset comprises storing a plurality of performances records to a database, wherein generate a first performance vector comprising the performance of each one of the first plurality of models on the first dataset comprises retrieving the first plurality of performance records from the database, wherein each of the plurality of performance records is associated with the first dataset and one of the first plurality of models, wherein each of the plurality of performance records comprises performance data indicating the performance of the associated model on the first dataset. 17 . The method of claim 12 further comprising: estimating the performance of one or more of the modeling possibilities not in the third plurality of models on the third dataset using collaborative filtering or matrix factorization techniques; and adding the estimated performances to the third performance vector. 18 . The method of claim 12 wherein generating a plurality modeling possibilities across a plurality of modeling methodologies comprises: enumerating a plurality of hyperpartitions across a plurality of modeling methodologies; and for optimizable model parameters and hyperparameters, choose a feasible step size to derive a plurality of modeling possibilities. 19 . A method for machine learning comprising: (a) receiving a dataset; (b) enumerating a plurality of hyperpartitions across a plurality of modeling methodologies; (c) generating a pluralit

Assignees

Inventors

Classifications

  • G06N99/005Primary

    Physics · mapped topic

  • G06N20/10Primary

    using kernel methods, e.g. support vector machines [SVM] · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016132787A1 cover?
A system is provided for multi-methodology, multi-user, self-optimizing Machine Learning as a Service for that automates and optimizes the model training process. The system uses a large-scale distributed architecture and is compatible with cloud services. The system uses a hybrid optimization technique to select between multiple machine learning approaches for a given dataset. The system can a…
Who is the assignee on this patent?
Drevo Will D, Veeramachaneni Kalyan K, O'Reilly Una-May, and 1 more
What technology area does this patent fall under?
Primary CPC classification G06N99/005. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu May 12 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).