Method and system for adaptively imputing sparse and missing data for predictive models

US11455284B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11455284-B2
Application numberUS-201916564910-A
CountryUS
Kind codeB2
Filing dateSep 9, 2019
Priority dateSep 16, 2016
Publication dateSep 27, 2022
Grant dateSep 27, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described is an approach that provides an adaptive solution to missing data for machine learning systems. A gradient solution is provided that is attentive to imputation needs at each of several missingness levels. This multilevel approach treats data missingness at any of multiple severity levels while utilizing, as much as possible, the actual observed data.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for imputing data for a learning system, comprising: collecting data from a monitored target system; determining one or more levels of missingness for the data collected from the monitored target system, wherein the one or more levels of missingness for the data at least comprise a first factor corresponding to a respective degree of missingness for each of a plurality of signal patterns in the data, wherein a signal pattern is comprised of at least one or more of a first value or one or more of a second value, the first value indicating that the data is missing and the second value indicating that the data is not missing, and a degree of missingness for a signal pattern is determined at least by identifying a percentage of the data that corresponds to the signal pattern; adaptively selecting a technique to correct the data for missing data, wherein the data is selectively corrected by a first technique for a relatively lower level of the missing data or selectively corrected by a second technique for a relatively greater level of the missing data, wherein detection of a first missingness level is addressed by the first technique by imputing the missing data using an iterative imputation technique to generate training data without using new data from an external data source, and detection of a second missingness level that is greater than the first missingness level is addressed by the second technique to generate the training data by adding the new data from the external data source; and performing model training with the training data. 2. The method of claim 1 , wherein the one or more levels of missingness for the data comprise a second factor corresponding to an overall degree of missingness for the data. 3. The method of claim 1 , wherein the technique is selected from a plurality of techniques that comprises some or all of a first imputation technique that performs expectation maximization to impute the missing data at the first missingness level, a second imputation technique that performs the expectation maximization with the new data from the external data source at the second missingness level, a third imputation technique that generates the training data using predicted values from a predictive model at a third missingness level, or a fourth imputation technique that performs simulation to generate the training data at a fourth missingness level. 4. The method of claim 1 , wherein expectation maximization is selected as the selected technique based upon both an overall level of missing data and individual levels of missing data for signals. 5. The method of claim 4 , wherein the external data source is accessed to generate an EM seed for the expectation maximization when insufficient seed data exists within the data collected from the monitored target system. 6. The method of claim 1 , wherein a second imputation technique is selected to impute the missing data when a first imputation technique does not successfully generate the missing data. 7. The method of claim 1 , wherein the model training generates a predictive model that is employed for health monitoring of a database system. 8. The method of claim 1 , wherein the plurality of signal patterns in the data comprise at least a first pattern and a second pattern, wherein the first pattern corresponds to a first permutation of missing and not missing signals and a second pattern corresponds to a second permutation of missing and not missing signals. 9. A system for imputing data for a machine learning system, comprising: a processor; and a memory for holding programmable code, wherein the programmable code includes instructions for collecting data from a monitored target system; determining one or more levels of missingness for the data collected from the monitored target system, wherein the one or more levels of missingness for the data at least comprise a first factor corresponding to a respective degree of missingness for each of a plurality of signal patterns in the data, wherein a signal pattern is comprised of at least one or more of a first value or one or more of a second value, the first value indicating that the data is missing and the second value indicating that the data is not missing, and a degree of missingness for a signal pattern is determined at least by identifying a percentage of the data that corresponds to the signal pattern; adaptively selecting a technique to correct the data for missing data, wherein the data is selectively corrected by a first technique for a relatively lower level of the missing data or selectively corrected by a second technique for a relatively greater level of the missing data, wherein detection of a first missingness level is addressed by the first technique by imputing the missing data using an iterative imputation technique to generate training data without using new data from an external data source, and detection of a second missingness level that is greater than the first missingness level is addressed by the second technique to generate the training data by adding the new data from the external data source; and performing model training with the training data. 10. The system of claim 9 , wherein the one or more levels of missingness for the data comprise a second factor corresponding to an overall degree of missingness for the data. 11. The system of claim 9 , wherein the technique is selected from a plurality of techniques that comprises some or all of a first imputation technique that performs expectation maximization to impute the missing data at the first missingness level, a second imputation technique that performs the expectation maximization with the new data from the external data source at the second missingness level, a third imputation technique that generates the training data using predicted values from a predictive model at a third missingness level, or a fourth imputation technique that performs simulation to generate the training data at a fourth missingness level. 12. The system of claim 9 , wherein expectation maximization is selected as the selected technique based upon both an overall level of missing data and individual levels of missing data for signals. 13. The system of claim 12 , wherein the external data source is accessed to generate an EM seed for the expectation maximization when insufficient seed data exists within the data collected from the monitored target system. 14. The system of claim 9 , wherein a second imputation technique is selected to impute the missing data when a first imputation technique does not successfully generate the missing data. 15. The system of claim 9 , wherein the model training generates a predictive model that is employed for health monitoring of a database system. 16. A non-transitory computer readable medium having stored thereon a sequence of instructions which, when executed by a processor, executes a method comprising: collecting data from a monitored target system; determining one or more levels of missingness for the data collected from the monitored target system, wherein the one or more levels of missingness for the data at least comprise a first factor corresponding to a respective degree of missingness for each of a plurality of signal patterns in the data, wherein a signal pattern is comprised of at least one or more of a first value or one or more of a second value, the first value indicating that the data is missing and the second value indicating that the data is not missing, and a degree of missingness for a signal pattern is deter

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Machine learning · CPC title

  • characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability (for optimising operational conditions of wireless networks H04W24/02) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11455284B2 cover?
Described is an approach that provides an adaptive solution to missing data for machine learning systems. A gradient solution is provided that is attentive to imputation needs at each of several missingness levels. This multilevel approach treats data missingness at any of multiple severity levels while utilizing, as much as possible, the actual observed data.
Who is the assignee on this patent?
Oracle Int Corp
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 27 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).