Data model processing in machine learning using a reduced set of features

US11429899B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11429899-B2
Application numberUS-202016863460-A
CountryUS
Kind codeB2
Filing dateApr 30, 2020
Priority dateApr 30, 2020
Publication dateAug 30, 2022
Grant dateAug 30, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer system trains a predictive model. A plurality of subsets of features are selected from a dataset comprising a plurality of cases and controls and a plurality of features. Cases and controls are matched to select a plurality of case-control subsets for each subset of features, each case-control subset having similar values for the corresponding subset of features. For each case-control subset, a statistical significance of each feature of the plurality of features absent from the subset of features used to match the case-control subset is identified. A final subset of features is selected based on satisfying a statistical significance of each feature for the plurality of case-control subsets. A predictive model is trained using the final subset of features. Embodiments of the present invention further include a method and program product for training a predictive model in substantially the same manner described above.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method for training a predictive model comprising: selecting, from a dataset comprising a plurality of cases and controls and a plurality of features, a plurality of subsets of features; matching cases and controls to select a plurality of case-control subsets for each subset of features, each case-control subset having similar values for the corresponding subset of features; identifying, for each case-control subset, a statistical significance of each feature of the plurality of features absent from the subset of features used to match the case-control subset; selecting a final subset of features based on the statistical significance of each feature for the plurality of case-control subsets; and training a predictive model using the final subset of features. 2. The computer-implemented method of claim 1 , further comprising: applying the predictive model to predict outcomes. 3. The computer-implemented method of claim 1 , wherein selecting the final subset of features comprises: determining a selection score for each feature of the plurality of features, wherein the selection score corresponds to a number of case-control subsets in which the statistical significance of the feature satisfies a significance threshold value; and ranking the plurality of features by selection score to select the final subset of features having selection scores that satisfy a selection threshold value. 4. The computer-implemented method of claim 3 , wherein the significance threshold value comprises a probability score of the feature. 5. The computer-implemented method of claim 3 , wherein the selection threshold value comprises a percentage of case-control subsets in which the statistical significance of the feature satisfies the significance threshold value. 6. The computer-implemented method of claim 1 , further comprising: evaluating the predictive model against a reference model to validate accuracy of the predictive model, wherein the reference model is trained using the dataset. 7. The computer-implemented method of claim 1 , wherein each case-control subset is matched according to propensity score matching with a caliper value and a case-control ratio value. 8. A computer system for training a predictive model, the computer system comprising: one or more computer processors; one or more computer readable storage media; program instructions stored on the one or more computer readable storage media for execution by at least one of the one or more computer processors, the program instructions comprising instructions to: select, from a dataset comprising a plurality of cases and controls and a plurality of features, a plurality of subsets of features; match cases and controls to select a plurality of case-control subsets for each subset of features, each case-control subset having similar values for the corresponding subset of features; identify, for each case-control subset, a statistical significance of each feature of the plurality of features absent from the subset of features used to match the case-control subset; select a final subset of features based on the statistical significance of each feature for the plurality of case-control subsets; and train a predictive model using the final subset of features. 9. The computer system of claim 8 , wherein the program instructions further comprise instructions to: apply the predictive model to predict outcomes. 10. The computer system of claim 8 , wherein the program instructions to select the final subset of features comprise instructions to: determine a selection score for each feature of the plurality of features, wherein the selection score corresponds to a number of case-control subsets in which the statistical significance of the feature satisfies a significance threshold value; and rank the plurality of features by selection score to select the final subset of features having selection scores that satisfy a selection threshold value. 11. The computer system of claim 10 , wherein the significance threshold value comprises a probability score of the feature. 12. The computer system of claim 10 , wherein the selection threshold value comprises a percentage of case-control subsets in which the statistical significance of the feature satisfies the significance threshold value. 13. The computer system of claim 8 , wherein the program instructions further comprise instructions to: evaluate the predictive model against a reference model to validate accuracy of the predictive model, wherein the reference model is trained using the dataset. 14. The computer system of claim 8 , wherein each case-control subset is matched according to propensity score matching with a caliper value and a case-control ratio value. 15. A computer program product for training a predictive model, the computer program product comprising one or more computer readable storage media collectively having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to: select, from a dataset comprising a plurality of cases and controls and a plurality of features, a plurality of subsets of features; match cases and controls to select a plurality of case-control subsets for each subset of features, each case-control subset having similar values for the corresponding subset of features; identify, for each case-control subset, a statistical significance of each feature of the plurality of features absent from the subset of features used to match the case-control subset; select a final subset of features based on the statistical significance of each feature for the plurality of case-control subsets; and train a predictive model using the final subset of features. 16. The computer program product of claim 15 , wherein the program instructions further cause the computer to: apply the predictive model to predict outcomes. 17. The computer program product of claim 15 , wherein the program instructions to select the final subset of features cause the computer to: determine a selection score for each feature of the plurality of features, wherein the selection score corresponds to a number of case-control subsets in which the statistical significance of the feature satisfies a significance threshold value; and rank the plurality of features by selection score to select the final subset of features having selection scores that satisfy a selection threshold value. 18. The computer program product of claim 17 , wherein the significance threshold value comprises a probability score of the feature. 19. The computer program product of claim 17 , wherein the selection threshold value comprises a percentage of case-control subsets in which the statistical significance of the feature satisfies the significance threshold value. 20. The computer program product of claim 15 , wherein the program instructions further cause the computer to: evaluate the predictive model against a reference model to validate accuracy of the predictive model, wherein the reference model is trained using the dataset.

Assignees

Inventors

Classifications

  • G06N20/00Primary

    Machine learning · CPC title

  • Knowledge representation; Symbolic representation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11429899B2 cover?
A computer system trains a predictive model. A plurality of subsets of features are selected from a dataset comprising a plurality of cases and controls and a plurality of features. Cases and controls are matched to select a plurality of case-control subsets for each subset of features, each case-control subset having similar values for the corresponding subset of features. For each case-contro…
Who is the assignee on this patent?
IBM, Massachusetts Inst Technology
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 30 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).