Method and system for cleansing training data for predictive models

US10909095B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10909095-B2
Application numberUS-201715707417-A
CountryUS
Kind codeB2
Filing dateSep 18, 2017
Priority dateSep 16, 2016
Publication dateFeb 2, 2021
Grant dateFeb 2, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described is an improved approach to implement selection of training data for machine learning, by presenting a designated set of specific data indicators where these data indicators correspond to metrics that end users are familiar with and are easily understood by ordinary users and DBAs within their knowledge domain. Selection of these indicators would correlate automatically to selection of a corresponding set of other metrics/signals that are less understandable to an ordinary user. Additional analysis of the selected data can then be performed to identify and correct any statistical problems with the selected training data.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for generating training data for a machine learning system, comprising: generating training data for machine learning, wherein the training data is generated at least by: collecting data pertaining to an operating state of a monitored target system; receiving a selection by a user of one or more metrics or signals corresponding to the data; determining one or more additional metrics or signals pertaining to the data based at least in part upon grouping information that correlates the one or more metrics or signals that have been selected by the user with the one or more additional metrics or signals that were not presented to the user to select into the selection; and filtering the data collected from the monitored target system into the training data based at least in part upon one or more filter criteria that correspond to both the one or more metrics or signals selected by the user and the one or more additional metrics or signals not selected by the user; and performing model training with the training data. 2. The method of claim 1 , further comprising: analyzing the one or more metrics or signals and the additional metrics or signals to determine whether a potential statistical problem exists in the training data when the one or more metrics or signals are applied as one or more datapoints for the training data, wherein the grouping information is included in the one or more metrics or signals. 3. The method of claim 2 , further comprising: selecting a timeframe for the one or more metrics or signals; selecting a value range for the one or more metric or signals; identifying a set of datapoints that corresponds to the timeframe and the value range, wherein the set of datapoints is analyzed to identify the potential statistical problem that is determined to exist in the training data; and correcting the potential statistical problem at least by changing the set of datapoints for the training data. 4. The method of claim 3 , wherein the potential statistical problem is corrected by applying at least one of: accepting the potential statistical problem; performing another iteration for selecting at least one of the timeframe or the value range; applying a prioritization or weighting to the one or more metrics or signals when identifying the set of datapoints; or receiving user expansion of the set of datapoints. 5. The method of claim 1 , wherein the one or more additional metrics or signals are correlated to the one or more metrics or signals at least by identifying a grouping field within the data collected from the monitored target system, wherein the grouping field comprises information that identifies one or more related metrics or signals. 6. The method of claim 1 , wherein at least one set of datapoints for the training data is expanded by at least one of a set of preceding datapoints or a set of trailing datapoints. 7. The method of claim 1 , wherein a predictive model is generated from the model training with the training data, the predictive model being applied to monitor health of a clustered database system. 8. The method of claim 1 , wherein the training data is merged with data for a second target system, and a predictive model is generated for the second target system using merged data from both the monitored target system and the second target system. 9. A system for generating training data for a machine learning system, comprising: a processor; and a memory for holding programmable code, wherein the programmable code includes instructions for executing a set of acts by the processor, the set of acts comprising: generating training data for machine learning, wherein the training data is generated at least by: collecting data pertaining to an operating state of a monitored target system; receiving a selection by a user of one or more metrics or signals corresponding to the data; determining one or more additional metrics or signals pertaining to the data based at least in part upon grouping information that correlates the one or more metrics or signals that have been selected by the user with the one or more additional metrics or signals that were not presented to the user to select into selection; filtering the data collected from the monitored target system into the training data based at least in part upon one or more filter criteria that correspond to both the one or more metrics or signals selected by the user and the one or more additional metrics or signals not selected by the user; and performing model training with the training data. 10. The system of claim 9 , wherein the programmable code further includes instructions for analyzing the one or more metrics or signals and the one or more additional metrics or signals to determine whether a potential statistical problem exists in the training data when the one or more metrics or signals are applied as datapoints for the training data. 11. The system of claim 10 , wherein the programmable code further includes instructions for: selecting a timeframe for the one or more metrics or signals; selecting a value range for the one or more metrics or signals; identifying a set of datapoints that corresponds to the timeframe and the value range, wherein the set of datapoints is analyzed to identify the potential statistical problem that is determined to exist in the training data; and correcting the potential statistical problem at least by changing the set of datapoints for the training data. 12. The system of claim 11 , wherein the potential statistical problem is corrected by applying at least one of: accepting the potential statistical problem; performing another iteration for selecting at least one of the timeframe or the value range; applying a prioritization or weighting to the one or more metrics or signals when identifying the set of datapoints; or receiving user expansion of the set of datapoints. 13. The system of claim 9 , wherein the one or more additional metrics or signals are correlated to the one or more metrics or signals at least by identifying a grouping field within the data collected from the monitored target system, wherein the grouping field comprises information that identifies one or more related metrics or signals. 14. The system of claim 9 , wherein at least one set of datapoints for the training data is expanded by at least one of a set of preceding datapoints or a set of trailing datapoints. 15. The system of claim 9 , wherein a predictive model is generated from the model training with the training data, the predictive model being applied to monitor health of a clustered database system. 16. The system of claim 9 , wherein the training data is merged with data for a second target system, and a predictive model is generated for the second target system using merged data from both the monitored target system and the second target system. 17. A computer program product embodied on a non-transitory computer readable medium having stored thereon a sequence of instructions which, when executed by a processor, causes the processor to execute a set of acts, the set of acts comprising: generating training data for machine learning, wherein the training data is generated at least by: collecting data pertaining to an operating state of a monitored target system; receiving a selection by a user of one or more metrics or signals corresponding to the data; determining one or more additional metrics or signals pertaining to the data based at least in part upon grouping information that correlates the one or more metrics or signals that have been

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability (for optimising operational conditions of wireless networks H04W24/02) · CPC title

  • using logs of notifications; Post-processing of notifications · CPC title

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10909095B2 cover?
Described is an improved approach to implement selection of training data for machine learning, by presenting a designated set of specific data indicators where these data indicators correspond to metrics that end users are familiar with and are easily understood by ordinary users and DBAs within their knowledge domain. Selection of these indicators would correlate automatically to selection of…
Who is the assignee on this patent?
Oracle Int Corp
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 02 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).