Method and system for adaptively imputing sparse and missing data for predictive models

US10409789B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10409789-B2
Application numberUS-201715707500-A
CountryUS
Kind codeB2
Filing dateSep 18, 2017
Priority dateSep 16, 2016
Publication dateSep 10, 2019
Grant dateSep 10, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described is an approach that provides an adaptive solution to missing data for machine learning systems. A gradient solution is provided that is attentive to imputation needs at each of several missingness levels. This multilevel approach treats data missingness at any of multiple severity levels while utilizing, as much as possible, the actual observed data.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for imputing data for a learning system, comprising: collecting data from a monitored target system; determining one or more levels of missingness for the data collected from the monitored target system; selecting, from among a plurality of imputation techniques, a selected imputation technique based at least in part upon the one or more levels of missingness for the data, wherein expectation maximization (EM) is selected as the selected imputation technique if it is determined that both an overall level of missing data and individual levels of missing data for signals are at one or more designated thresholds, and an external data source is accessed to generate an EM seed for the expectation maximization when insufficient seed data exists within the data collected from the monitored target system; imputing missing data using the selected imputation technique to generate training data; and performing model training with the training data. 2. The method of claim 1 , wherein the one or more levels of missingness for the data comprise a first factor corresponding to an overall degree of missingness for the data, a second factor corresponding to one or more degrees of missingness for individual signals within a dataset, and a third factor corresponding to missingness degrees for different signal patterns in the data. 3. The method of claim 1 , wherein the plurality of imputation techniques comprises some or all of a first imputation technique that performs expectation maximization to impute the missing data at a first level of missingness, a second imputation technique that performs the expectation maximization with external data at a second level of missingness, a third imputation technique that generates the training data using predicted values from a predictive model at a third level of missingness, or a fourth imputation technique that performs simulation to generate the training data at a fourth level of missingness. 4. The method of claim 1 , wherein a second imputation technique is selected to impute the missing data when a first imputation technique does not successfully generate the missing data. 5. The method of claim 1 , wherein the model training generates a predictive model that is employed for health monitoring of a database system. 6. A system for imputing data for a machine learning system, comprising: a processor; a memory for holding programmable code; and wherein the programmable code includes instructions for collecting data from a monitored target system; determining one or more levels of missingness for the data collected from the monitored target system; selecting, from among a plurality of imputation techniques, a selected imputation technique based at least in part upon the one or more levels of missingness for the data, wherein expectation maximization (EM) is selected as the selected imputation technique if it is determined that both an overall level of missing data and individual levels of missing data for signals are at one or more designated thresholds, and an external data source is accessed to generate an EM seed for the expectation maximization when insufficient seed data exists within the data collected from the monitored target system; imputing missing data using the selected imputation technique to generate training data; and performing model training with the training data. 7. The system of claim 6 , wherein the one or more levels of missingness for the data comprise a first factor corresponding to an overall degree of missingness for the data, a second factor corresponding to one or more degrees of missingness for individual signals within a dataset, and a third factor corresponding to missingness degrees for different signal patterns in the data. 8. The system of claim 6 , wherein the plurality of imputation techniques comprises some or all of a first imputation technique that performs expectation maximization to impute the missing data at a first level of missingness, a second imputation technique that performs the expectation maximization with external data at a second level of missingness, a third imputation technique that generates the training data using predicted values from a predictive model at a third level of missingness, or a fourth imputation technique that performs simulation to generate the training data at a fourth level of missingness. 9. The system of claim 6 , wherein a second imputation technique is selected to impute the missing data when a first imputation technique does not successfully generate the missing data. 10. The system of claim 6 , wherein the model training generates a predictive model that is employed for health monitoring of a database system. 11. A computer program product embodied on a non-transitory computer readable medium, the non-transitory computer readable medium having stored thereon a sequence of instructions which, when executed by a processor, executes a method comprising: collecting data from a monitored target system; determining one or more levels of missingness for the data collected from the monitored target system; selecting, from among a plurality of imputation techniques, a selected imputation technique based at least in part upon the one or more levels of missingness for the data, wherein expectation maximization (EM) is selected as the selected imputation technique if it is determined that both an overall level of missing data and individual levels of missing data for signals are at one or more designated thresholds, and an external data source is accessed to generate an EM seed for the expectation maximization when insufficient seed data exists within the data collected from the monitored target system; imputing missing data using the selected imputation technique to generate training data; and performing model training with the training data. 12. The computer program product of claim 11 , wherein the one or more levels of missingness for the data comprise a first factor corresponding to an overall degree of missingness for the data, a second factor corresponding to one or more degrees of missingness for individual signals within a dataset, and a third factor corresponding to missingness degrees for different signal patterns in the data. 13. The computer program product of claim 11 , wherein the plurality of imputation techniques comprises some or all of a first imputation technique that performs expectation maximization to impute the missing data at a first level of missingness, a second imputation technique that performs the expectation maximization with external data at a second level of missingness, a third imputation technique that generates the training data using predicted values from a predictive model at a third level of missingness, or a fourth imputation technique that performs simulation to generate the training data at a fourth level of missingness. 14. The computer program product of claim 11 , wherein a second imputation technique is selected to impute the missing data when a first imputation technique does not successfully generate the missing data. 15. The computer program product of claim 11 , wherein the model training generates a predictive model that is employed for health monitoring of a database system.

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Ensemble learning · CPC title

  • using statistical or mathematical methods · CPC title

  • characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability (for optimising operational conditions of wireless networks H04W24/02) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10409789B2 cover?
Described is an approach that provides an adaptive solution to missing data for machine learning systems. A gradient solution is provided that is attentive to imputation needs at each of several missingness levels. This multilevel approach treats data missingness at any of multiple severity levels while utilizing, as much as possible, the actual observed data.
Who is the assignee on this patent?
Oracle Int Corp
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 10 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).