Rule-based deconfliction of overlapping data
US-2024185097-A1 · Jun 6, 2024 · US
US9443194B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9443194-B2 |
| Application number | US-201213445796-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 12, 2012 |
| Priority date | Feb 23, 2012 |
| Publication date | Sep 13, 2016 |
| Grant date | Sep 13, 2016 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Provided are techniques for imputing a missing value for each of one or more predictor variables. Data is received from one or more data sources. For each of the one or more predictor variables, an imputation model is built based on information of a target variable; a type of imputation model to construct is determined based on the one or more data sources, a measurement level of the predictor variable, and a measurement level of the target variable; and the determined type of imputation model is constructed using basic statistics of the predictor variable and the target variable. The missing value is imputed for each of the one or more predictor variables using the data from the one or more data sources and one or more built imputation models to generate a completed data set.
Opening claim text (preview).
The invention claimed is: 1. A method for imputing a missing value for each of one or more predictor variables, comprising: with each mapper from a set of mappers that receives data from a different data source, building an imputation model based on information of a target variable and that data source; with a reducer, randomly extracting validation samples from each different data source to create a global validation sample; with each mapper, scoring the imputation model built at that mapper based on the global validation sample and the imputation model built at each other mapper; with the reducer, selecting a top number of imputation models based on the scoring of each imputation model to form an ensemble model; and with each mapper, determining a type of imputation model to construct; imputing the missing value for each of the one or more predictor variables using the data from each different data source, each ensemble model, and the determined type of imputation model to generate a completed data set; and performing prediction, discovery, and interpretation of relationships between the target variable and the one or more predictor variables using the completed data set. 2. The method of claim 1 , wherein a measurement level comprises one of continuous and categorical. 3. The method of claim 1 , further comprising: determining that a predictor variable is continuous and the target variable is continuous; sorting records into data bins based on the predictor variable values; collecting, for each of the data bins, statistics comprising a number of records, a mean of the predictor variable, a mean of the target variable, a variance of the target variable, and a covariance of the predictor variable and the target variable; determining that the type of the imputation model is a piecewise linear regression imputation model; and building the piecewise linear regression imputation model using the collected statistics. 4. The method of claim 1 , further comprising: determining that a predictor variable is continuous and the target variable is categorical; sorting records into data bins based on the predictor variable values; collecting, for each category of the target variable and each of the data bins, statistics comprising a number of records and a mean of the predictor variable; determining that the type of the imputation model is a robust conditional mean imputation model; and building the robust conditional mean imputation model using the collected statistics. 5. The method of claim 1 , further comprising: determining that a predictor variable is categorical and the target variable is continuous; collecting, for each category of the predictor variable, statistics comprising a mean of the target variable and a variance of the target variable; determining that the type of the imputation model is a minimum z-score category imputation model; and building the minimum z-score category imputation model using the collected statistics. 6. The method of claim 1 , further comprising: determining that a predictor variable is categorical and the target variable is categorical; collecting, for each category combination of the predictor variable and the target variable, statistics comprising a number of records; determining that the type of the imputation model is a conditional mode imputation model; and building the conditional mode imputation model using the collected statistics. 7. The method of claim 1 , wherein software is provided as a service in a cloud environment. 8. A computer program product for imputing a missing value for each of one or more predictor variables, the computer program product comprising a non-transitory computer readable storage medium having computer readable program code embodied therein, the computer readable program code, when executed by a processor of a computer, configured to perform: with each mapper from a set of mappers that receives data from a different data source, building an imputation model based on information of a target variable and that data source; with a reducer, randomly extracting validation samples from each different data source to create a global validation sample; with each mapper, scoring the imputation model built at that mapper based on the global validation sample and the imputation model built at each other mapper; with the reducer, selecting a top number of imputation models based on the scoring of each imputation model to form an ensemble model; and with each mapper, determining a type of imputation model to construct; imputing the missing value for each of the one or more predictor variables using the data from each different data source, each ensemble model, and the determined type of imputation model to generate a completed data set; and performing prediction, discovery, and interpretation of relationships between the target variable and the one or more predictor variables using the completed data set. 9. The computer program product of claim 8 , wherein the computer readable program code, when executed by the processor of the computer, is configured to perform: determining that a predictor variable is continuous and the target variable is continuous; sorting records into data bins based on the predictor variable values; collecting, for each of the data bins, statistics comprising a number of records, a mean of the predictor variable, a mean of the target variable, a variance of the target variable, and a covariance of the predictor variable and the target variable; determining that the type of the imputation model is a piecewise linear regression imputation model; and building the piecewise linear regression imputation model using the collected statistics. 10. The computer program product of claim 8 , wherein the computer readable program code, when executed by the processor of the computer, is configured to perform: determining that a predictor variable is continuous and the target variable is categorical; sorting records into data bins based on the predictor variable values; collecting, for each category of the target variable and each of the data bins, statistics comprising a number of records and a mean of the predictor variable; determining that the type of the imputation model is a robust conditional mean imputation model; and building the robust conditional mean imputation model using the collected statistics. 11. The computer program product of claim 8 , wherein the computer readable program code, when executed by the processor of the computer, is configured to perform: determining that a predictor variable is categorical and the target variable is continuous; collecting, for each category of the predictor variable, statistics comprising a mean of the target variable and a variance of the target variable; determining that the type of the imputation model is a minimum z-score category imputation model; and building the minimum z-score category imputation model using the collected statistics. 12. The computer program product of claim 8 , wherein the computer readable program code, when executed by the processor of the computer, is configured to perform: determining that a predictor variable is categorical and the target variable is categorical; collecting, for each category combination of the predictor variable and the target variable, statistics comprising a number of records; determining that the type of the imputation model is a conditional mode imputation model; and building the conditional mode imputation model using the collected statistics. 13. The computer program product of claim 8 , wherein software is provid
Related publications grouped by family.
Answers are generated from the same data shown on this page.