Real-time minimal vector labeling scheme for supervised machine learning
US-2022083815-A1 · Mar 17, 2022 · US
US12561614B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12561614-B2 |
| Application number | US-202318100307-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 23, 2023 |
| Priority date | Jan 23, 2023 |
| Publication date | Feb 24, 2026 |
| Grant date | Feb 24, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Aspects of the disclosure relate to smart sampling of noisy labels using artificial intelligence. A computing platform may receive a dataset of primarily unlabeled data points. The computing platform may apply undersampling to the unlabeled data points to reduce imbalance. The computing platform may assign a candidate label to each unlabeled data point in the dataset without a human manually labeling the unlabeled data points. The computing platform may compute a heuristic score for each data point and rank the data points based on the heuristic score. The computing platform may subsample the dataset by comparing the heuristic score for each data point against more than one threshold and applying a k-Nearest Neighbors (k-NN) algorithm, with two different k values, to identify untrustworthy labels. The computing platform may provide or transmit a trustworthy resulting dataset to a machine learning (ML) model.
Opening claim text (preview).
We claim: 1 . A smart subsampling method for constructing a trustworthy training dataset for machine learning using a k-Nearest Neighbors (k-NN) algorithm with different k values and more than one threshold selection mechanism, the method comprising: at a computing platform comprising at least one processor, and memory: receiving, from one or more hardware devices, a dataset of labeled and unlabeled data points, wherein the dataset comprises primarily unlabeled data points; applying undersampling to the unlabeled data points to reduce imbalance between the labeled data points and the unlabeled data points; assigning a candidate label to each unlabeled data point in the dataset without a human manually labeling the unlabeled data points, wherein the candidate label for each unlabeled data point indicates a negative class and a label for each labeled data point indicates a positive class; computing a heuristic score for each data point in the dataset; ranking the data points in the dataset based on the heuristic score computed for each data point; subsampling the dataset by comparing the heuristic score for each data point against more than one threshold and applying a k-Nearest Neighbors (k-NN) algorithm, where k is 1 or 5, to identify untrustworthy labels, wherein subsampling the unlabeled data points comprises: removing unlabeled data points having a heuristic score greater than a first threshold; using a 1-NN algorithm to identify, for each unlabeled data point, a label of its single nearest-neighboring data point; and removing unlabeled data points having a heuristic score greater than a second threshold and where the label of the single nearest-neighboring data point is of the positive class, wherein the first threshold is greater than the second threshold, and wherein subsampling the labeled data points comprises: removing labeled data points having a heuristic score less than a third threshold; using a 5-NN algorithm to identify, for each labeled data point, labels of its five nearest-neighboring data points; and removing labeled data points having a heuristic score less than a fourth threshold and where the labels of all five of its nearest-neighboring data points are of the negative class, wherein the third threshold is less than the fourth threshold; and transmitting a trustworthy resulting dataset to a machine learning (ML) model. 2 . The method of claim 1 , wherein the heuristic score is a ranking that is normalized to a number between 0 and 1. 3 . The method of claim 1 , wherein the first, second, third, and fourth thresholds are different from one another. 4 . The method of claim 1 , wherein the first threshold is 0.95, the second threshold is 0.6, the third threshold is 0.05, and the fourth threshold is 0.4. 5 . The method of claim 1 , wherein the dataset of labeled and unlabeled data points is obtained from a database of prior financial transactions associated with customer accounts. 6 . The method of claim 1 , wherein the ML model outputs as a visualization a summary ranking of customer accounts having financial opportunities. 7 . The method of claim 1 , wherein the ML model infers a likelihood that an electronic message comprises spam. 8 . The method of claim 1 , wherein the ML model infers a likelihood that an electronic message comprises a subject matter of interest. 9 . The method of claim 1 , wherein applying undersampling to the unlabeled data points comprises random subsampling to produce a dataset with a predetermined percentage of unlabeled data points. 10 . The method of claim 1 , wherein the heuristic score is computed based on a set of predefined rules. 11 . A system configured to construct a trustworthy training dataset for machine learning from a dataset of primarily unlabeled data points, the system comprising: one or more processors; and memory storing a software package that, when executed by the one or more processors, cause the system to: receive a dataset of labeled and unlabeled data points, wherein the dataset comprises primarily unlabeled data points; apply undersampling to the unlabeled data points to reduce imbalance between labeled data points and the unlabeled data points; assign a candidate label to each unlabeled data point in the dataset without a human manually labeling the unlabeled data points, wherein the candidate label for each unlabeled data point indicates a negative class and a label for each labeled data point indicates a positive class; compute a heuristic score for each data point in the dataset; rank the data points in the dataset based on the heuristic score computed for each data point; subsample the dataset, via a software package, by comparing the heuristic score for each data point against more than one threshold and applying a k-Nearest Neighbors (k-NN) algorithm, where k is 1 or 5, to identify untrustworthy labels, wherein subsampling the unlabeled data points comprises: removing unlabeled data points having a heuristic score greater than a first threshold; using a 1-NN algorithm to identify, for each unlabeled data point, a label of its single nearest-neighboring data point; and removing unlabeled data points having a heuristic score greater than a second threshold and where the label of the single nearest-neighboring data point is of the positive class, wherein the first threshold is greater than the second threshold, and wherein the subsampling the labeled data points comprises: removing labeled data points having a heuristic score less than a third threshold; using a 5-NN algorithm to identify, for each labeled data point, labels of its five nearest-neighboring data points; and removing labeled data points having a heuristic score less than a fourth threshold and where the labels of all five of its nearest-neighboring data points are of the negative class, wherein the third threshold is less than the fourth threshold; and transmit a trustworthy resulting dataset to a machine learning (ML) model. 12 . The system of claim 11 , wherein the heuristic score is a ranking that is normalized to a number between 0 and 1. 13 . The system of claim 11 , wherein the first, second, third, and fourth thresholds are different from one another. 14 . The system of claim 11 , wherein the first threshold is 0.95, the second threshold is 0.6, the third threshold is 0.05, and the fourth threshold is 0.4. 15 . The system of claim 11 , wherein the dataset of labeled and unlabeled data points is obtained from a database of prior financial transactions associated with customer accounts. 16 . The system of claim 11 , wherein the ML model outputs as a visualization a summary ranking of customer accounts having financial opportunities. 17 . The system of claim 11 , wherein the ML model infers a likelihood that an electronic message comprises spam. 18 . The system of claim 11 , wherein applying undersampling to the unlabeled data points comprises random subsampling to produce a dataset with a predetermined percentage of unlabeled data points. 19 . The system of claim 11 , wherein the heuristic score is computed based on a set of predefined rules. 20 . A method for executing a machine learning (ML) model, the method comprising: at a computing platform comprising at least one processor, and memory: training a machine learning (ML) model; testing the ML model; and deploying the ML model, wherein the ML model is trained by: receiving, from one or more hardware devices, a dataset of
Machine learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.