Scalable generation of multidimensional features for machine learning

US11295229B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11295229-B1
Application numberUS-201615132959-A
CountryUS
Kind codeB1
Filing dateApr 19, 2016
Priority dateApr 19, 2016
Publication dateApr 5, 2022
Grant dateApr 5, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An approximate count of a subset of records of a data set is obtained using one or more transformation functions. The subset comprises records which contain a first value of one input variable, a second value of another input variable, and a particular value of a target variable. Using the approximate count, an approximate correlation metric for a multidimensional feature and the target variable is obtained. Based on the correlation metric, the multidimensional feature is included in a candidate feature set to be used to train a machine learning model.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: one or more computing devices of a machine learning service of a provider network; wherein the one or more computing devices are configured to: generate a set of multidimensional features from a plurality of input variables in a first data set of observation records, wherein the first data set is to be used to train a linear machine learning model to predict a target variable in the first data set based on a linear combination of values of individual selected input variables, and the set of multidimensional features includes a first multidimensional feature derived from a first input variable and a second input variable in the plurality of input variables in the first data set; select a first set of execution platforms to execute an analysis to identify a first candidate feature set of the first data set for training the linear machine learning model, wherein an amount of execution platforms selected for the analysis is reduced based on use of an approximation technique to approximate correlation metrics of the multidimensional features; execute the analysis using the set of execution platforms and the approximation technique to approximate a correlation metric between the first multidimensional feature and the target variable, including to: generate, using min-wise hashing, a set of multidimensional signatures using one or more transformation functions applied to the first data set of observation records, wherein individual ones of the multidimensional signatures correspond to respective subsets of the observation records in the first data set having respective combinations of values for one of the input variables and the target variable; determine an approximate count of a population of observation records of the first data set having a same combination of values, including (a) a first value for the first input variable, (b) a second value for the second input variable, and (c) a third value of the target variable, wherein the approximate count is determined based at least in part on a number of matching signature components in a first multidimensional signature generated for observation records having the first value for the first input variable and a second multidimensional signature generated for observation records having the second value for the second input variable; and generate, based at least in part on the approximate count of the population, an approximate value of the correlation metric between the first multidimensional feature and the target variable; include, based at least in part on determining that the approximate value of the correlation metric meets a first threshold criterion, the first multidimensional feature in the first candidate feature set for training the linear machine learning model; determine, with respect to the first multidimensional feature included in the first candidate feature set, an exact value of the correlation metric between the first multidimensional feature and the target variable; based at least in part on determining that the exact value of the correlation metric meets a second threshold criterion, initiate training of the linear machine learning model using a final feature set which includes the first multidimensional feature; and after the training of the linear machine learning model using the final feature set: determine that a prediction accuracy of the linear machine learning model does not meet a prediction accuracy criterion; perform a second analysis of the first data set using the approximation technique to identify a second candidate feature set that is different from the first candidate feature set and includes one or more multidimensional features but does not include the first multidimensional feature; and retrain the linear machine learning model using the second candidate feature set. 2. The system as recited in claim 1 , wherein the one or more computing devices are configured to: determine that, with respect to a second data set, a preparation of a third candidate feature set is to include a calculation of a second approximate correlation metric between (a) a particular multidimensional feature of the second data set and (b) a target variable of the second data set, wherein the particular multidimensional feature is derived from a combination of at least three input variables of the second data set; and obtain the second approximate correlation metric based at least in part on an approximate co-occurrence count obtained using min-wise hashing for a quadratic feature of the second data set, wherein the quadratic feature is derived at least in part from (a) a first input variable of the at least three input variables and (b) a second input variable of the at least three input variables. 3. The system as recited in claim 1 , wherein the one or more computing devices are configured to: determine that, with respect to a second data set, a second set of execution platforms are to be employed to compute multidimensional signatures of observation records of the second data set for min-wise hashing; subdivide the second data set into a plurality of partitions; initiate a determination of a first multidimensional signature corresponding to at least a portion of a first partition of the plurality of partitions at a first execution platform of the second set of execution platforms; and initiate a determination of a second multidimensional signature corresponding to at least a portion of a second partition of the plurality of partitions at a second execution platform of the second set of execution platforms. 4. The system as recited in claim 1 , wherein the one or more computing devices are configured to: identify, with respect to a second data set, a plurality of features for which respective approximate correlation metrics are to be determined with respect to a target variable of the second data set; initiate a first determination of a first approximate correlation metric of a first feature of the plurality of features at a first execution platform of a second set of execution platforms, wherein the first determination is based at least in part on a multidimensional signature obtained using min-wise hashing from at least a first portion of the second data set; and initiate a second determination of a second approximate correlation metric of a second feature of the plurality of features at a second execution platform of the second set of execution platforms. 5. The system as recited in claim 1 , wherein the one or more computing devices are configured to: select a particular number of hash functions to be used for the min-wise hashing based at least on part on one or more of: (a) an error threshold associated with the approximate count or (b) an indication of resource capacity available on the first set of execution platforms for the min-wise hashing. 6. A method, comprising: performing, by one or more computing devices: generating a set of higher-order features from a plurality of input variables in a first data set of observation records, wherein the first data set is to be used to train a linear machine learning model to predict a target variable in the first data set based on a linear combination of values of individual selected input variables, and the set of higher-order features includes a first higher-order feature derived from a first input variable and a second input variable in the plurality of input variables in the first data set; selecting a set of execution platforms to execute an analysis to identify a first candidate feature set of the first data set for training the linear machine learning model, wherein an amount of execution platforms selected for the analysis is reduced based on use of an approximation technique to approximate correlation metrics of the

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11295229B1 cover?
An approximate count of a subset of records of a data set is obtained using one or more transformation functions. The subset comprises records which contain a first value of one input variable, a second value of another input variable, and a particular value of a target variable. Using the approximate count, an approximate correlation metric for a multidimensional feature and the target variabl…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 05 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).