What technology area does this patent fall under?

Primary CPC classification G06F16/215. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jul 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Intelligent scoring of missing data records

US12346784B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12346784-B2
Application number	US-202017022734-A
Country	US
Kind code	B2
Filing date	Sep 16, 2020
Priority date	Sep 16, 2020
Publication date	Jul 1, 2025
Grant date	Jul 1, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

One or more computer processors group a plurality of predictors contained in training data into a plurality of predictor groups. The one or more computer processors create a plurality of sample sets, wherein each sample set in the plurality of sample sets contains one or more predictors selected from a respective predictor group in the plurality of predictor groups. The one or more computer processors create a cluster model for each created sample set in the plurality of created sample sets. The one or more computer processors generate a score for a record with one or more missing values utilizing at least one created cluster model of the created cluster models and at least one created sample set of the created sample sets.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: monitoring, by one or more computer processors, a database server for training data with one or more missing values; grouping, by one or more computer processors, a plurality of predictors contained in the training data into a plurality of predictor groups, wherein a number of predictors in the plurality of predictors associated with records with the one or more missing values is less than a square root of a total number of predictors; creating, by one or more computer processors, a plurality of sample sets, wherein each sample set in the plurality of sample sets contains one or more predictors selected from a respective predictor group in the plurality of predictor groups; training, by one or more computer processors, a cluster model for each created sample set in the plurality of created sample sets; and generating, by one or more computer processors, a score for a record with one or more missing values utilizing at least one created cluster model of the created cluster models and at least one created sample set of the created sample sets, comprising: responsive to a created top sample set, generating, by one or more computer processors, the score utilizing an ensemble score defined by a distance between a formed vector to each respective center of each cluster associated with each sample set in the top sample set. 2. The computer-implemented method of claim 1 , wherein grouping the plurality of predictors contained in the training data into the plurality of predictor groups, comprises: creating, by one or more computer processors, the plurality of predictor groups; randomly assigning, by one or more computer processors, a predictor in the plurality of predictors to each created predictor group until each predictor group in the plurality of predictor groups has at least one assigned predictor; and assigning, by one or more computer processors, each remaining predictor in the plurality of predictors into a respective predictor group by utilizing one or more correlations between each remaining predictor in the plurality of predictors and each predictor group in the plurality of predictors. 3. The computer-implemented method of claim 1 , wherein generating the score for the record with one or more missing values utilizing at least one created cluster model of the created cluster models and at least one created sample set of the created sample sets, comprises: reducing, by one or more computer processors, the record with one or more missing values into three categories: suitable, inexact suitable, and not suitable based on one or more relationships between a record with one or more missing values, each sample set in the plurality of sample sets, and associated cluster models. 4. The computer-implemented method of claim 3 , further comprising: calculating, by one or more computer processors, a cluster center vector for each cluster model associated with each created sample set in the plurality of created sample sets. 5. The computer-implemented method of claim 4 , further comprising: creating, by one or more computer processors, the top sample set from the plurality of samples sets based the category of the reduced record with one or more missing values. 6. The computer-implemented method of claim 5 , further comprising: ensemble scoring, by one or more computer processors, the record with one or more missing values utilizing a calculated distance between a formed vector to each calculated cluster center associated with each cluster model associated with each sample set in the top sample set, wherein the formed vector represents the record with one or more missing values. 7. The computer-implemented method of claim 6 , further comprising: generating, by one or more computer processors, the score for the record with one or more missing values utilizing a correlation distance between the formed vector and each associated cluster center in the top sample set as a continuous value. 8. The computer-implemented method of claim 6 , further comprising: generating, by one or more computer processors, the score for the record with one or more missing values utilizing a correlation distance between the formed vector and each associated cluster center in the top sample set as a weight in a categorical scoring process. 9. A computer program product comprising: one or more computer readable storage media and program instructions stored on the one or more computer readable storage media, the stored program instructions comprising: program instructions to monitor a database server for training data with one or more missing values; program instructions to group a plurality of predictors contained in the training data into a plurality of predictor groups, wherein a number of predictors in the plurality of predictors associated with records with the one or more missing values is less than a square root of a total number of predictors; program instructions to create a plurality of sample sets, wherein each sample set in the plurality of sample sets contains one or more predictors selected from a respective predictor group in the plurality of predictor groups; program instructions to train a cluster model for each created sample set in the plurality of created sample sets; and program instructions to generate a score for a record with one or more missing values utilizing at least one created cluster model of the created cluster models and at least one created sample set of the created sample sets, comprising: program instructions to responsive to a created top sample set, generate the score utilizing an ensemble score defined by a distance between a formed vector to each respective center of each cluster associated with each sample set in the top sample set. 10. The computer program product of claim 9 , wherein the program instructions, to generate the score for the record with one or more missing values utilizing at least one created cluster model of the created cluster models and at least one created sample set of the created sample sets, comprise: program instructions to reduce the record with one or more missing values into three categories: suitable, inexact suitable, and not suitable based on one or more relationships between a record with one or more missing values, each sample set in the plurality of sample sets, and associated cluster models. 11. The computer program product of claim 10 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to calculate a cluster center vector for each cluster model associated with each created sample set in the plurality of created sample sets. 12. The computer program product of claim 11 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to create the top sample set from the plurality of samples sets based the category of the reduced record with one or more missing values. 13. The computer program product of claim 12 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to ensemble score the record with one or more missing values utilizing a calculated distance between a formed vector to each calculated cluster center associated with each cluster model associated with each sample set in the top sample set, wherein the formed vector represents the record with one or more missing values. 14. The computer program product of claim 13 , wherein the program instructions, stored on the o

Assignees

Inventors

Classifications

G06F16/215Primary
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
G06F16/285
Clustering or classification · CPC title
G06N20/00
Machine learning · CPC title
G06N20/20Primary
Ensemble learning · CPC title

Patent family

Related publications grouped by family.

View patent family 80626803

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12346784B2 cover?: One or more computer processors group a plurality of predictors contained in training data into a plurality of predictor groups. The one or more computer processors create a plurality of sample sets, wherein each sample set in the plurality of sample sets contains one or more predictors selected from a respective predictor group in the plurality of predictor groups. The one or more computer pro…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jul 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Flexible imputation of missing data

System and method for efficient generation of machine-learning models

Systems and techniques for determining the predictive value of a feature

Compatibility prediction based on object attributes

Missing value imputation for predictive models

Machine learning classifier

Parallel Processing with Cooperative Multitasking

Frequently asked questions