Smart sampling of noisy labels using artificial intelligence

US12561614B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12561614-B2
Application numberUS-202318100307-A
CountryUS
Kind codeB2
Filing dateJan 23, 2023
Priority dateJan 23, 2023
Publication dateFeb 24, 2026
Grant dateFeb 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Aspects of the disclosure relate to smart sampling of noisy labels using artificial intelligence. A computing platform may receive a dataset of primarily unlabeled data points. The computing platform may apply undersampling to the unlabeled data points to reduce imbalance. The computing platform may assign a candidate label to each unlabeled data point in the dataset without a human manually labeling the unlabeled data points. The computing platform may compute a heuristic score for each data point and rank the data points based on the heuristic score. The computing platform may subsample the dataset by comparing the heuristic score for each data point against more than one threshold and applying a k-Nearest Neighbors (k-NN) algorithm, with two different k values, to identify untrustworthy labels. The computing platform may provide or transmit a trustworthy resulting dataset to a machine learning (ML) model.

First claim

Opening claim text (preview).

We claim: 1 . A smart subsampling method for constructing a trustworthy training dataset for machine learning using a k-Nearest Neighbors (k-NN) algorithm with different k values and more than one threshold selection mechanism, the method comprising: at a computing platform comprising at least one processor, and memory: receiving, from one or more hardware devices, a dataset of labeled and unlabeled data points, wherein the dataset comprises primarily unlabeled data points; applying undersampling to the unlabeled data points to reduce imbalance between the labeled data points and the unlabeled data points; assigning a candidate label to each unlabeled data point in the dataset without a human manually labeling the unlabeled data points, wherein the candidate label for each unlabeled data point indicates a negative class and a label for each labeled data point indicates a positive class; computing a heuristic score for each data point in the dataset; ranking the data points in the dataset based on the heuristic score computed for each data point; subsampling the dataset by comparing the heuristic score for each data point against more than one threshold and applying a k-Nearest Neighbors (k-NN) algorithm, where k is 1 or 5, to identify untrustworthy labels, wherein subsampling the unlabeled data points comprises: removing unlabeled data points having a heuristic score greater than a first threshold; using a 1-NN algorithm to identify, for each unlabeled data point, a label of its single nearest-neighboring data point; and removing unlabeled data points having a heuristic score greater than a second threshold and where the label of the single nearest-neighboring data point is of the positive class, wherein the first threshold is greater than the second threshold, and wherein subsampling the labeled data points comprises: removing labeled data points having a heuristic score less than a third threshold; using a 5-NN algorithm to identify, for each labeled data point, labels of its five nearest-neighboring data points; and removing labeled data points having a heuristic score less than a fourth threshold and where the labels of all five of its nearest-neighboring data points are of the negative class, wherein the third threshold is less than the fourth threshold; and transmitting a trustworthy resulting dataset to a machine learning (ML) model. 2 . The method of claim 1 , wherein the heuristic score is a ranking that is normalized to a number between 0 and 1. 3 . The method of claim 1 , wherein the first, second, third, and fourth thresholds are different from one another. 4 . The method of claim 1 , wherein the first threshold is 0.95, the second threshold is 0.6, the third threshold is 0.05, and the fourth threshold is 0.4. 5 . The method of claim 1 , wherein the dataset of labeled and unlabeled data points is obtained from a database of prior financial transactions associated with customer accounts. 6 . The method of claim 1 , wherein the ML model outputs as a visualization a summary ranking of customer accounts having financial opportunities. 7 . The method of claim 1 , wherein the ML model infers a likelihood that an electronic message comprises spam. 8 . The method of claim 1 , wherein the ML model infers a likelihood that an electronic message comprises a subject matter of interest. 9 . The method of claim 1 , wherein applying undersampling to the unlabeled data points comprises random subsampling to produce a dataset with a predetermined percentage of unlabeled data points. 10 . The method of claim 1 , wherein the heuristic score is computed based on a set of predefined rules. 11 . A system configured to construct a trustworthy training dataset for machine learning from a dataset of primarily unlabeled data points, the system comprising: one or more processors; and memory storing a software package that, when executed by the one or more processors, cause the system to: receive a dataset of labeled and unlabeled data points, wherein the dataset comprises primarily unlabeled data points; apply undersampling to the unlabeled data points to reduce imbalance between labeled data points and the unlabeled data points; assign a candidate label to each unlabeled data point in the dataset without a human manually labeling the unlabeled data points, wherein the candidate label for each unlabeled data point indicates a negative class and a label for each labeled data point indicates a positive class; compute a heuristic score for each data point in the dataset; rank the data points in the dataset based on the heuristic score computed for each data point; subsample the dataset, via a software package, by comparing the heuristic score for each data point against more than one threshold and applying a k-Nearest Neighbors (k-NN) algorithm, where k is 1 or 5, to identify untrustworthy labels, wherein subsampling the unlabeled data points comprises: removing unlabeled data points having a heuristic score greater than a first threshold; using a 1-NN algorithm to identify, for each unlabeled data point, a label of its single nearest-neighboring data point; and removing unlabeled data points having a heuristic score greater than a second threshold and where the label of the single nearest-neighboring data point is of the positive class, wherein the first threshold is greater than the second threshold, and wherein the subsampling the labeled data points comprises: removing labeled data points having a heuristic score less than a third threshold; using a 5-NN algorithm to identify, for each labeled data point, labels of its five nearest-neighboring data points; and removing labeled data points having a heuristic score less than a fourth threshold and where the labels of all five of its nearest-neighboring data points are of the negative class, wherein the third threshold is less than the fourth threshold; and transmit a trustworthy resulting dataset to a machine learning (ML) model. 12 . The system of claim 11 , wherein the heuristic score is a ranking that is normalized to a number between 0 and 1. 13 . The system of claim 11 , wherein the first, second, third, and fourth thresholds are different from one another. 14 . The system of claim 11 , wherein the first threshold is 0.95, the second threshold is 0.6, the third threshold is 0.05, and the fourth threshold is 0.4. 15 . The system of claim 11 , wherein the dataset of labeled and unlabeled data points is obtained from a database of prior financial transactions associated with customer accounts. 16 . The system of claim 11 , wherein the ML model outputs as a visualization a summary ranking of customer accounts having financial opportunities. 17 . The system of claim 11 , wherein the ML model infers a likelihood that an electronic message comprises spam. 18 . The system of claim 11 , wherein applying undersampling to the unlabeled data points comprises random subsampling to produce a dataset with a predetermined percentage of unlabeled data points. 19 . The system of claim 11 , wherein the heuristic score is computed based on a set of predefined rules. 20 . A method for executing a machine learning (ML) model, the method comprising: at a computing platform comprising at least one processor, and memory: training a machine learning (ML) model; testing the ML model; and deploying the ML model, wherein the ML model is trained by: receiving, from one or more hardware devices, a dataset of

Assignees

Inventors

Classifications

  • G06N20/00Primary

    Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12561614B2 cover?
Aspects of the disclosure relate to smart sampling of noisy labels using artificial intelligence. A computing platform may receive a dataset of primarily unlabeled data points. The computing platform may apply undersampling to the unlabeled data points to reduce imbalance. The computing platform may assign a candidate label to each unlabeled data point in the dataset without a human manually la…
Who is the assignee on this patent?
Bank Of America
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).