Automated machine learning using nearest neighbor recommender systems

US11941541B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11941541-B2
Application numberUS-202016988809-A
CountryUS
Kind codeB2
Filing dateAug 10, 2020
Priority dateAug 10, 2020
Publication dateMar 26, 2024
Grant dateMar 26, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, computer program products and/or systems are provided that perform the following operations: obtaining a performance matrix representing accuracies obtained by executing a plurality of pipelines on a plurality of training data sets, wherein a pipeline comprises a series of operations performed on a data set; selecting a defined number of top pipelines as potential pipelines for a testing data set based, at least in part, on a similarity between the testing data set and each of the plurality of training data sets represented in the performance matrix; storing results from executing each of the potential pipelines as a new data set; determining a pipeline accuracy for each of the potential pipelines when executed against the testing data set; and providing a recommended pipeline for use with the testing data set based, at least in part, on the pipeline accuracy for each potential pipeline.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: obtaining a performance matrix representing accuracies obtained by executing a plurality of machine-learning pipelines on a plurality of training data sets, wherein a machine-learning pipeline comprises a series of operations performed on a data set; selecting a defined number of top machine-learning pipelines as potential machine-learning pipelines for a testing data set based, at least in part, on computing a similarity between the testing data set and each of the plurality of training data sets represented in the performance matrix; determining a pipeline accuracy for each of the potential machine-learning pipelines when executed against the testing data set; providing a recommended machine-learning pipeline for use with the testing data set based, at least in part, on the pipeline accuracy for each potential machine-learning pipeline; and performing initialization for pipeline recommendation prior to the computing of similarities between the testing data set and each of the plurality of training data sets represented in the performance matrix, the initialization comprising: obtaining data set metafeatures for each of the plurality of training data sets in the performance matrix; obtaining data set metafeatures for the testing data set; determining a plurality of similar training data sets that are similar to the testing data set based in part on the data set metafeatures for each of the plurality of training data sets and the data set metafeatures for the testing data set; selecting a defined number of most similar training data sets; selecting one or more columns from the performance matrix that correspond to the selected most similar training data sets; determining an initialization stage average rank for each machine-learning pipeline based on the selected columns of the performance matrix; selecting a plurality of top machine-learning pipelines as initialization stage machine-learning pipelines based on the initialization stage average rank for each machine-learning pipeline; storing initialization stage results from executing each of the initialization stage machine-learning pipelines as an initialization data set; and providing the initialization data set as the testing data set; storing results from executing each of the potential machine-learning pipelines as a new data set. 2. The computer-implemented method of claim 1 wherein selecting the defined number of top machine-learning pipelines as potential machine-learning pipelines for the testing data set further comprises: selecting a defined number of columns of the performance matrix based on the similarity between the testing data set and each of the plurality of training data sets; and selecting the defined number of top machine-learning pipelines as potential machine-learning pipelines for the testing data set based on the selected columns of the performance matrix. 3. The computer-implemented method of claim 2 further comprising: determining an average rank for each machine-learning pipeline based on the selected columns of the performance matrix and the accuracies represented in the performance matrix; and selecting the defined number of top machine-learning pipelines based on the average rank for each machine-learning pipeline. 4. The computer-implemented method of claim 2 , wherein when computing the similarity between the testing data set and each of the plurality of training data sets, only entries that are present in both the training data set and the testing data set being compared for similarity are used. 5. The computer-implemented method of claim 1 further comprising: determining a similarity-weighted mean and variance of pipeline accuracy for each machine-learning pipeline; and selecting the defined number of top machine-learning pipelines based on an expected, improvement criteria associated with the mean and variance. 6. The computer-implemented method of claim 1 , wherein the data set metafeatures for each of the plurality of training data sets and the metafeatures for the testing data set comprise: a number of missing values in a data set; a number of categorical features; a number of real-valued features; and quantile distributions of the data set or individual features. 7. The computer-implemented method of claim 1 , wherein each (i, j) entry in the performance matrix represents an accuracy obtained by executing a pipeline i and a training data set j. 8. The computer-implemented method of claim 1 further comprising performing an iterative series of pipeline accuracy determinations, for a defined number of iterations, in response to the determining of the pipeline accuracy for each of the potential machine-learning pipelines when executed against the testing data set and prior to providing the recommended machine-learning pipeline, the performing of the iterative series of pipeline accuracy determinations comprising: computing a similarity between the stored results from executing each of the potential machine-learning pipelines and each of the plurality of training data sets represented in the performance matrix; selecting a defined number of columns of the performance matrix based on the similarity between the stored results and each of the plurality of training data sets; determining an average rank for each machine-learning pipeline based on the selected columns of the performance matrix; selecting a defined number of top-ranked machine-learning pipelines that are unexecuted as potential machine-learning pipelines; storing iteration result sets from executing each of the potential machine-learning pipelines; determining the pipeline accuracy for each of the potential machine-learning pipelines when executed against the stored iteration result sets; and determining a new recommended machine-learning pipeline based on the pipeline accuracy for each potential machine-learning pipeline. 9. A computer program product comprising a computer readable storage medium having stored thereon: program instructions programmed to obtain a performance matrix representing accuracies obtained by executing a plurality of machine-learning pipelines on a plurality of training data sets, wherein a machine-learning pipeline comprises a series of operations performed on a data set; program instructions programmed to select a defined number of top machine-learning pipelines as potential machine-learning pipelines for a testing data set based on computing a similarity between the testing data set and each of the plurality of training data sets represented in the performance matrix; program instructions programmed to store results from executing each of the potential machine-learning pipelines as a new data set; program instructions programmed to determine a pipeline accuracy for each of the potential machine-learning pipelines when executed against the testing data set; program instructions programmed to provide a recommended machine-learning pipeline for use with the testing data set based, at least in part, on the pipeline accuracy for each potential machine-learning pipeline; and program instructions programmed to perform initialization for pipeline recommendation prior to the computing of similarities between the testing data set and each of the plurality of training data sets represented in the performance matrix, the initialization comprising: obtaining data set metafeatures for each of the plurality of training data sets in the performance matrix; obtaining data set metafeatures for the testing data set; determining a plurality of similar training data sets that are similar to the testing data set based in part on the data set metafeatures for each of

Assignees

Inventors

Classifications

  • G06N5/04Primary

    Inference or reasoning models · CPC title

  • Matrix or vector computation {, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization (matrix transposition G06F7/78)} · CPC title

  • by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation · CPC title

  • based on specific statistical tests · CPC title

  • Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11941541B2 cover?
Methods, computer program products and/or systems are provided that perform the following operations: obtaining a performance matrix representing accuracies obtained by executing a plurality of pipelines on a plurality of training data sets, wherein a pipeline comprises a series of operations performed on a data set; selecting a defined number of top pipelines as potential pipelines for a testi…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N5/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 26 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).