Missing value imputation for predictive models
US-9443194-B2 · Sep 13, 2016 · US
US11455284B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11455284-B2 |
| Application number | US-201916564910-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 9, 2019 |
| Priority date | Sep 16, 2016 |
| Publication date | Sep 27, 2022 |
| Grant date | Sep 27, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Described is an approach that provides an adaptive solution to missing data for machine learning systems. A gradient solution is provided that is attentive to imputation needs at each of several missingness levels. This multilevel approach treats data missingness at any of multiple severity levels while utilizing, as much as possible, the actual observed data.
Opening claim text (preview).
What is claimed is: 1. A method for imputing data for a learning system, comprising: collecting data from a monitored target system; determining one or more levels of missingness for the data collected from the monitored target system, wherein the one or more levels of missingness for the data at least comprise a first factor corresponding to a respective degree of missingness for each of a plurality of signal patterns in the data, wherein a signal pattern is comprised of at least one or more of a first value or one or more of a second value, the first value indicating that the data is missing and the second value indicating that the data is not missing, and a degree of missingness for a signal pattern is determined at least by identifying a percentage of the data that corresponds to the signal pattern; adaptively selecting a technique to correct the data for missing data, wherein the data is selectively corrected by a first technique for a relatively lower level of the missing data or selectively corrected by a second technique for a relatively greater level of the missing data, wherein detection of a first missingness level is addressed by the first technique by imputing the missing data using an iterative imputation technique to generate training data without using new data from an external data source, and detection of a second missingness level that is greater than the first missingness level is addressed by the second technique to generate the training data by adding the new data from the external data source; and performing model training with the training data. 2. The method of claim 1 , wherein the one or more levels of missingness for the data comprise a second factor corresponding to an overall degree of missingness for the data. 3. The method of claim 1 , wherein the technique is selected from a plurality of techniques that comprises some or all of a first imputation technique that performs expectation maximization to impute the missing data at the first missingness level, a second imputation technique that performs the expectation maximization with the new data from the external data source at the second missingness level, a third imputation technique that generates the training data using predicted values from a predictive model at a third missingness level, or a fourth imputation technique that performs simulation to generate the training data at a fourth missingness level. 4. The method of claim 1 , wherein expectation maximization is selected as the selected technique based upon both an overall level of missing data and individual levels of missing data for signals. 5. The method of claim 4 , wherein the external data source is accessed to generate an EM seed for the expectation maximization when insufficient seed data exists within the data collected from the monitored target system. 6. The method of claim 1 , wherein a second imputation technique is selected to impute the missing data when a first imputation technique does not successfully generate the missing data. 7. The method of claim 1 , wherein the model training generates a predictive model that is employed for health monitoring of a database system. 8. The method of claim 1 , wherein the plurality of signal patterns in the data comprise at least a first pattern and a second pattern, wherein the first pattern corresponds to a first permutation of missing and not missing signals and a second pattern corresponds to a second permutation of missing and not missing signals. 9. A system for imputing data for a machine learning system, comprising: a processor; and a memory for holding programmable code, wherein the programmable code includes instructions for collecting data from a monitored target system; determining one or more levels of missingness for the data collected from the monitored target system, wherein the one or more levels of missingness for the data at least comprise a first factor corresponding to a respective degree of missingness for each of a plurality of signal patterns in the data, wherein a signal pattern is comprised of at least one or more of a first value or one or more of a second value, the first value indicating that the data is missing and the second value indicating that the data is not missing, and a degree of missingness for a signal pattern is determined at least by identifying a percentage of the data that corresponds to the signal pattern; adaptively selecting a technique to correct the data for missing data, wherein the data is selectively corrected by a first technique for a relatively lower level of the missing data or selectively corrected by a second technique for a relatively greater level of the missing data, wherein detection of a first missingness level is addressed by the first technique by imputing the missing data using an iterative imputation technique to generate training data without using new data from an external data source, and detection of a second missingness level that is greater than the first missingness level is addressed by the second technique to generate the training data by adding the new data from the external data source; and performing model training with the training data. 10. The system of claim 9 , wherein the one or more levels of missingness for the data comprise a second factor corresponding to an overall degree of missingness for the data. 11. The system of claim 9 , wherein the technique is selected from a plurality of techniques that comprises some or all of a first imputation technique that performs expectation maximization to impute the missing data at the first missingness level, a second imputation technique that performs the expectation maximization with the new data from the external data source at the second missingness level, a third imputation technique that generates the training data using predicted values from a predictive model at a third missingness level, or a fourth imputation technique that performs simulation to generate the training data at a fourth missingness level. 12. The system of claim 9 , wherein expectation maximization is selected as the selected technique based upon both an overall level of missing data and individual levels of missing data for signals. 13. The system of claim 12 , wherein the external data source is accessed to generate an EM seed for the expectation maximization when insufficient seed data exists within the data collected from the monitored target system. 14. The system of claim 9 , wherein a second imputation technique is selected to impute the missing data when a first imputation technique does not successfully generate the missing data. 15. The system of claim 9 , wherein the model training generates a predictive model that is employed for health monitoring of a database system. 16. A non-transitory computer readable medium having stored thereon a sequence of instructions which, when executed by a processor, executes a method comprising: collecting data from a monitored target system; determining one or more levels of missingness for the data collected from the monitored target system, wherein the one or more levels of missingness for the data at least comprise a first factor corresponding to a respective degree of missingness for each of a plurality of signal patterns in the data, wherein a signal pattern is comprised of at least one or more of a first value or one or more of a second value, the first value indicating that the data is missing and the second value indicating that the data is not missing, and a degree of missingness for a signal pattern is deter
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
Machine learning · CPC title
characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability (for optimising operational conditions of wireless networks H04W24/02) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.