Intelligent scoring of missing data records

US12346784B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12346784-B2
Application numberUS-202017022734-A
CountryUS
Kind codeB2
Filing dateSep 16, 2020
Priority dateSep 16, 2020
Publication dateJul 1, 2025
Grant dateJul 1, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

One or more computer processors group a plurality of predictors contained in training data into a plurality of predictor groups. The one or more computer processors create a plurality of sample sets, wherein each sample set in the plurality of sample sets contains one or more predictors selected from a respective predictor group in the plurality of predictor groups. The one or more computer processors create a cluster model for each created sample set in the plurality of created sample sets. The one or more computer processors generate a score for a record with one or more missing values utilizing at least one created cluster model of the created cluster models and at least one created sample set of the created sample sets.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: monitoring, by one or more computer processors, a database server for training data with one or more missing values; grouping, by one or more computer processors, a plurality of predictors contained in the training data into a plurality of predictor groups, wherein a number of predictors in the plurality of predictors associated with records with the one or more missing values is less than a square root of a total number of predictors; creating, by one or more computer processors, a plurality of sample sets, wherein each sample set in the plurality of sample sets contains one or more predictors selected from a respective predictor group in the plurality of predictor groups; training, by one or more computer processors, a cluster model for each created sample set in the plurality of created sample sets; and generating, by one or more computer processors, a score for a record with one or more missing values utilizing at least one created cluster model of the created cluster models and at least one created sample set of the created sample sets, comprising: responsive to a created top sample set, generating, by one or more computer processors, the score utilizing an ensemble score defined by a distance between a formed vector to each respective center of each cluster associated with each sample set in the top sample set. 2. The computer-implemented method of claim 1 , wherein grouping the plurality of predictors contained in the training data into the plurality of predictor groups, comprises: creating, by one or more computer processors, the plurality of predictor groups; randomly assigning, by one or more computer processors, a predictor in the plurality of predictors to each created predictor group until each predictor group in the plurality of predictor groups has at least one assigned predictor; and assigning, by one or more computer processors, each remaining predictor in the plurality of predictors into a respective predictor group by utilizing one or more correlations between each remaining predictor in the plurality of predictors and each predictor group in the plurality of predictors. 3. The computer-implemented method of claim 1 , wherein generating the score for the record with one or more missing values utilizing at least one created cluster model of the created cluster models and at least one created sample set of the created sample sets, comprises: reducing, by one or more computer processors, the record with one or more missing values into three categories: suitable, inexact suitable, and not suitable based on one or more relationships between a record with one or more missing values, each sample set in the plurality of sample sets, and associated cluster models. 4. The computer-implemented method of claim 3 , further comprising: calculating, by one or more computer processors, a cluster center vector for each cluster model associated with each created sample set in the plurality of created sample sets. 5. The computer-implemented method of claim 4 , further comprising: creating, by one or more computer processors, the top sample set from the plurality of samples sets based the category of the reduced record with one or more missing values. 6. The computer-implemented method of claim 5 , further comprising: ensemble scoring, by one or more computer processors, the record with one or more missing values utilizing a calculated distance between a formed vector to each calculated cluster center associated with each cluster model associated with each sample set in the top sample set, wherein the formed vector represents the record with one or more missing values. 7. The computer-implemented method of claim 6 , further comprising: generating, by one or more computer processors, the score for the record with one or more missing values utilizing a correlation distance between the formed vector and each associated cluster center in the top sample set as a continuous value. 8. The computer-implemented method of claim 6 , further comprising: generating, by one or more computer processors, the score for the record with one or more missing values utilizing a correlation distance between the formed vector and each associated cluster center in the top sample set as a weight in a categorical scoring process. 9. A computer program product comprising: one or more computer readable storage media and program instructions stored on the one or more computer readable storage media, the stored program instructions comprising: program instructions to monitor a database server for training data with one or more missing values; program instructions to group a plurality of predictors contained in the training data into a plurality of predictor groups, wherein a number of predictors in the plurality of predictors associated with records with the one or more missing values is less than a square root of a total number of predictors; program instructions to create a plurality of sample sets, wherein each sample set in the plurality of sample sets contains one or more predictors selected from a respective predictor group in the plurality of predictor groups; program instructions to train a cluster model for each created sample set in the plurality of created sample sets; and program instructions to generate a score for a record with one or more missing values utilizing at least one created cluster model of the created cluster models and at least one created sample set of the created sample sets, comprising: program instructions to responsive to a created top sample set, generate the score utilizing an ensemble score defined by a distance between a formed vector to each respective center of each cluster associated with each sample set in the top sample set. 10. The computer program product of claim 9 , wherein the program instructions, to generate the score for the record with one or more missing values utilizing at least one created cluster model of the created cluster models and at least one created sample set of the created sample sets, comprise: program instructions to reduce the record with one or more missing values into three categories: suitable, inexact suitable, and not suitable based on one or more relationships between a record with one or more missing values, each sample set in the plurality of sample sets, and associated cluster models. 11. The computer program product of claim 10 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to calculate a cluster center vector for each cluster model associated with each created sample set in the plurality of created sample sets. 12. The computer program product of claim 11 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to create the top sample set from the plurality of samples sets based the category of the reduced record with one or more missing values. 13. The computer program product of claim 12 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to ensemble score the record with one or more missing values utilizing a calculated distance between a formed vector to each calculated cluster center associated with each cluster model associated with each sample set in the top sample set, wherein the formed vector represents the record with one or more missing values. 14. The computer program product of claim 13 , wherein the program instructions, stored on the o

Assignees

Inventors

Classifications

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Clustering or classification · CPC title

  • Machine learning · CPC title

  • G06N20/20Primary

    Ensemble learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12346784B2 cover?
One or more computer processors group a plurality of predictors contained in training data into a plurality of predictor groups. The one or more computer processors create a plurality of sample sets, wherein each sample set in the plurality of sample sets contains one or more predictors selected from a respective predictor group in the plurality of predictor groups. The one or more computer pro…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).