Missing value imputation for predictive models

US9443194B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9443194-B2
Application numberUS-201213445796-A
CountryUS
Kind codeB2
Filing dateApr 12, 2012
Priority dateFeb 23, 2012
Publication dateSep 13, 2016
Grant dateSep 13, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Provided are techniques for imputing a missing value for each of one or more predictor variables. Data is received from one or more data sources. For each of the one or more predictor variables, an imputation model is built based on information of a target variable; a type of imputation model to construct is determined based on the one or more data sources, a measurement level of the predictor variable, and a measurement level of the target variable; and the determined type of imputation model is constructed using basic statistics of the predictor variable and the target variable. The missing value is imputed for each of the one or more predictor variables using the data from the one or more data sources and one or more built imputation models to generate a completed data set.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for imputing a missing value for each of one or more predictor variables, comprising: with each mapper from a set of mappers that receives data from a different data source, building an imputation model based on information of a target variable and that data source; with a reducer, randomly extracting validation samples from each different data source to create a global validation sample; with each mapper, scoring the imputation model built at that mapper based on the global validation sample and the imputation model built at each other mapper; with the reducer, selecting a top number of imputation models based on the scoring of each imputation model to form an ensemble model; and with each mapper, determining a type of imputation model to construct; imputing the missing value for each of the one or more predictor variables using the data from each different data source, each ensemble model, and the determined type of imputation model to generate a completed data set; and performing prediction, discovery, and interpretation of relationships between the target variable and the one or more predictor variables using the completed data set. 2. The method of claim 1 , wherein a measurement level comprises one of continuous and categorical. 3. The method of claim 1 , further comprising: determining that a predictor variable is continuous and the target variable is continuous; sorting records into data bins based on the predictor variable values; collecting, for each of the data bins, statistics comprising a number of records, a mean of the predictor variable, a mean of the target variable, a variance of the target variable, and a covariance of the predictor variable and the target variable; determining that the type of the imputation model is a piecewise linear regression imputation model; and building the piecewise linear regression imputation model using the collected statistics. 4. The method of claim 1 , further comprising: determining that a predictor variable is continuous and the target variable is categorical; sorting records into data bins based on the predictor variable values; collecting, for each category of the target variable and each of the data bins, statistics comprising a number of records and a mean of the predictor variable; determining that the type of the imputation model is a robust conditional mean imputation model; and building the robust conditional mean imputation model using the collected statistics. 5. The method of claim 1 , further comprising: determining that a predictor variable is categorical and the target variable is continuous; collecting, for each category of the predictor variable, statistics comprising a mean of the target variable and a variance of the target variable; determining that the type of the imputation model is a minimum z-score category imputation model; and building the minimum z-score category imputation model using the collected statistics. 6. The method of claim 1 , further comprising: determining that a predictor variable is categorical and the target variable is categorical; collecting, for each category combination of the predictor variable and the target variable, statistics comprising a number of records; determining that the type of the imputation model is a conditional mode imputation model; and building the conditional mode imputation model using the collected statistics. 7. The method of claim 1 , wherein software is provided as a service in a cloud environment. 8. A computer program product for imputing a missing value for each of one or more predictor variables, the computer program product comprising a non-transitory computer readable storage medium having computer readable program code embodied therein, the computer readable program code, when executed by a processor of a computer, configured to perform: with each mapper from a set of mappers that receives data from a different data source, building an imputation model based on information of a target variable and that data source; with a reducer, randomly extracting validation samples from each different data source to create a global validation sample; with each mapper, scoring the imputation model built at that mapper based on the global validation sample and the imputation model built at each other mapper; with the reducer, selecting a top number of imputation models based on the scoring of each imputation model to form an ensemble model; and with each mapper, determining a type of imputation model to construct; imputing the missing value for each of the one or more predictor variables using the data from each different data source, each ensemble model, and the determined type of imputation model to generate a completed data set; and performing prediction, discovery, and interpretation of relationships between the target variable and the one or more predictor variables using the completed data set. 9. The computer program product of claim 8 , wherein the computer readable program code, when executed by the processor of the computer, is configured to perform: determining that a predictor variable is continuous and the target variable is continuous; sorting records into data bins based on the predictor variable values; collecting, for each of the data bins, statistics comprising a number of records, a mean of the predictor variable, a mean of the target variable, a variance of the target variable, and a covariance of the predictor variable and the target variable; determining that the type of the imputation model is a piecewise linear regression imputation model; and building the piecewise linear regression imputation model using the collected statistics. 10. The computer program product of claim 8 , wherein the computer readable program code, when executed by the processor of the computer, is configured to perform: determining that a predictor variable is continuous and the target variable is categorical; sorting records into data bins based on the predictor variable values; collecting, for each category of the target variable and each of the data bins, statistics comprising a number of records and a mean of the predictor variable; determining that the type of the imputation model is a robust conditional mean imputation model; and building the robust conditional mean imputation model using the collected statistics. 11. The computer program product of claim 8 , wherein the computer readable program code, when executed by the processor of the computer, is configured to perform: determining that a predictor variable is categorical and the target variable is continuous; collecting, for each category of the predictor variable, statistics comprising a mean of the target variable and a variance of the target variable; determining that the type of the imputation model is a minimum z-score category imputation model; and building the minimum z-score category imputation model using the collected statistics. 12. The computer program product of claim 8 , wherein the computer readable program code, when executed by the processor of the computer, is configured to perform: determining that a predictor variable is categorical and the target variable is categorical; collecting, for each category combination of the predictor variable and the target variable, statistics comprising a number of records; determining that the type of the imputation model is a conditional mode imputation model; and building the conditional mode imputation model using the collected statistics. 13. The computer program product of claim 8 , wherein software is provid

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9443194B2 cover?
Provided are techniques for imputing a missing value for each of one or more predictor variables. Data is received from one or more data sources. For each of the one or more predictor variables, an imputation model is built based on information of a target variable; a type of imputation model to construct is determined based on the one or more data sources, a measurement level of the predictor …
Who is the assignee on this patent?
Chu Yea J, Han Sier, Shyr Jing-Yun, and 2 more
What technology area does this patent fall under?
Primary CPC classification G06N5/025. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 13 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).