Indexing based on feature importance

US2022300518A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2022300518-A1
Application numberUS-202117206335-A
CountryUS
Kind codeA1
Filing dateMar 19, 2021
Priority dateMar 19, 2021
Publication dateSep 22, 2022
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and a computer program product are used generating an index of a scoring payload dataset. Correlation coefficients for correlations between input data values and output data values of the machine learning model provided by the scoring payload datasets as well as performance data values of the processes provided by process datasets are calculated. Features of which feature values are used as input data values are ranked according to their importance using the correlation coefficients. For the features of a set of highest-ranking features feature value sets with feature values of the respective features are selected from the scoring payload datasets and a database index of the selected feature value sets is generated.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for generating an index of a scoring payload dataset, the method comprising: providing a set of scoring payload datasets; providing a set of process datasets, the process datasets being assigned to processes of a plurality of processes, the process datasets comprising performance data values providing performance measures for the processes to which the process datasets are assigned; combining the provided scoring payload datasets and the provided process datasets assigned to a same process to provide a set of combined datasets; calculating correlation coefficients for correlations between higher features of a first set of features and output data values; ranking the features of the first set of features using the correlation coefficients, wherein the higher the features are ranked, the larger the correlation coefficients calculated for the features are; selecting a set of highest-ranking features; identifying the features of the set of highest-ranking features feature value sets from the scoring payload datasets, the selected features value sets comprising the feature values of the scoring payload datasets assigned to the features of the set of highest-ranking features; and generating a database index of the identified feature value sets. 2 . The method of claim 1 , further comprising assigning the scoring payload datasets to processes of the plurality of processes, wherein: the method is executed at runtime; the machine learning model is trained to predict process results for the processes of the plurality of processes; and the scoring payload datasets comprise: first sets of first feature values provided to the machine learning model as input data values for predicting process results of the processes to which the scoring payload datasets are assigned, the first feature values being assigned to features of the first set of features; and output data values received from the machine learning model as output in response to providing the first sets of first feature values of the scoring payload datasets as input, the output data values of the scoring payload datasets describing the process results predicted for the processes to which the scoring payload datasets are assigned. 3 . The method of claim 1 : further comprising pre-processing a dataset chosen from the group consisting of the scoring payload datasets, the process datasets, and the combined datasets; and wherein the pre-processing comprises converting non-numerical data values comprised by the combined datasets to numerical data values. 4 . The method of claim 1 , further comprising splitting the set of combined datasets into batches according to a classification of the combined datasets, the batches comprising subsets of the combined datasets with combined datasets assigned to a same class, wherein the calculating of the correlation coefficients, the ranking of the features of the first set of features, the selecting of the set of highest-ranking features, the selecting of the feature value sets, and the generating of the database index are performed batchwise. 5 . The method of claim 4 , wherein the batchwise performance is executed for a plurality of the batches in parallel. 6 . The method of claim 4 , wherein the batchwise performance is executed subsequently for one batch after another. 7 . The method of claim 4 , wherein the first feature values are used for the classification of the combined datasets. 8 . The method of claim 1 , wherein the processed datasets further comprise second sets of second feature values, the second feature values being assigned to features of a second set of features characterizing the processes to which the process datasets are assigned. 9 . The method of claim 8 , wherein the second feature values are used for the classification of the combined datasets. 10 . The method of claim 1 : wherein the correlation coefficients are further calculated for correlations between the features of the first set of features and the performance data values using the combined datasets; wherein the correlation coefficients are calculated as part of a correlation matrix, the correlation matrix being calculated using the second feature values in addition to the first feature values, the output data values, and the performance data values; and further comprising extracting the correlation coefficients for the correlations between the features of the first set of features and the output data values as well as the correlation coefficients for the correlations between the features of the first set of features and the performance data values from the correlation matrix for the ranking of the features of the first set of features. 11 . The method of claim 1 , further comprising displaying the correlation coefficients of the selected set of highest-ranking features. 12 . The method of claim 1 , further comprising storing the correlation coefficients of the selected set of highest-ranking features. 13 . The method of claim 1 , wherein the database index further indexes the correlation coefficients of the selected set of highest-ranking features. 14 . The method of claim 1 , further comprising executing a data analysis with the selected feature value sets, the data analysis comprising executing one or more searches using the database index. 15 . A computer program product for selecting feature value sets from a set of scoring payload datasets of a machine learning model for indexing, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the scoring payload datasets being assigned to processes of a plurality of processes, the machine learning model being trained to predict process results for the processes of the plurality of processes, the scoring payload datasets comprising first sets of first feature values provided to the machine learning model as input data values for predicting process results of the processes to which the scoring payload datasets are assigned, the first feature values being assigned to features of a first set of features, the scoring payload datasets further comprising output data values received from the machine learning model as output in response to providing the first sets of first feature values of the scoring payload datasets as input, the output data values of the scoring payload datasets describing the process results predicted for the processes to which the scoring payload datasets are assigned, the program instructions being executable by a processor of a computer system to cause the computer system to: provide the set of scoring payload datasets; provide a set of process datasets, the process datasets being assigned to the processes of the plurality of processes, the process datasets comprising performance data values providing performance measures for the processes to which the process datasets are assigned; combine provided scoring payload datasets and provided process datasets assigned to the same process to provide a set of combined datasets; calculate correlation coefficients for correlations between the features of the first set of features and the output data values as well as correlations between the features of the first set of features and the performance data values using the combined datasets; rank the features of the first set of features according to their importance using the correlation coefficients, wherein the features are ranked the higher, the larger the correlation coefficients calculated for the features are; select a set of highest-ranking fea

Assignees

Inventors

Classifications

  • Indexing structures · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

  • Column-oriented storage; Management thereof · CPC title

  • Tablespace storage structures; Management thereof · CPC title

  • using ranking · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2022300518A1 cover?
A method and a computer program product are used generating an index of a scoring payload dataset. Correlation coefficients for correlations between input data values and output data values of the machine learning model provided by the scoring payload datasets as well as performance data values of the processes provided by process datasets are calculated. Features of which feature values are us…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Sep 22 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).