Training data acquisition method and device, server and storage medium
US-2021182611-A1 · Jun 17, 2021 · US
US11645515B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11645515-B2 |
| Application number | US-201916571323-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 16, 2019 |
| Priority date | Sep 16, 2019 |
| Publication date | May 9, 2023 |
| Grant date | May 9, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments relate to a system, program product, and method for automatically determining which activation data points in a neural model have been poisoned to erroneously indicate association with a particular label or labels. A neural network is trained using potentially poisoned training data. Each of the training data points is classified using the network to retain the activations of the last hidden layer, and segment those activations by the label of corresponding training data. Clustering is applied to the retained activations of each segment, and a cluster assessment is conducted for each cluster associated with each label to distinguish clusters with potentially poisoned activations from clusters populated with legitimate activations. The assessment includes executing a set of analyses and integrating the results of the analyses into a determination as to whether a training data set is poisonous based on determining if resultant activation clusters are poisoned.
Opening claim text (preview).
What is claimed is: 1. A computer system comprising: a processor operatively coupled to memory; and an artificial intelligence (AI) platform, in communication with the processor, having machine learning (ML) tools to process an untrusted data set, the tools comprising: a training manager configured to train a neural model with the untrusted data set; a ML manager, operatively coupled to the training manager, configured to classify each data point in the untrusted data set using the trained neural model, and to retain activations of one or more designated layers in the trained neural model; a cluster manager, operatively coupled to the ML manager, configured to apply a clustering technique on the retained activations for each label, and for each cluster to assess integrity of data in the cluster, including to analyze information from the untrusted data set and the clustered activations, the information comprising content of the data in the untrusted data set, noise distribution data with respect to the untrusted data set, and evidence of a preliminary cluster classification; and a classification manager, operatively coupled to the cluster manager, the classification manager configured to assign a poisonous classification or a legitimate classification to the assessed cluster, the assigned classification corresponding to the integrity assessment. 2. The system of claim 1 , wherein the integrity assessment of the cluster data further comprises the cluster manager configured to select a preliminary topic assignment or a topic assignment based on the analysis of the analyzed information. 3. The system of claim 2 , wherein the topic assignment based on the analysis further comprises the cluster manager configured to analyze topic text indicative of the poisonous classification or the legitimate classification. 4. The system of claim 1 , wherein the evidence of the preliminary cluster classification further comprises the cluster manager configured to analyze one or more of: known classification data associated with the untrusted data set; and/or determined classification data associated with the clustered activations. 5. The system of claim 1 , wherein the analysis of the noise distribution data further comprises the cluster manager configured to: select the noise distribution data from the group consisting of: noise data extracted through analysis of the untrusted data set and known noise distribution data provided with the untrusted data set. 6. The system of claim 1 , wherein the cluster manager is configured to rank the integrity assessments of the clusters as a function of historical performance. 7. The system of claim 1 , wherein the training manager is configured to retrain the neural model based on one or more of the integrity assessments. 8. A computer program product to utilize machine learning to process an untrusted data set, the computer program product comprising: a computer readable storage medium having program code embodied therewith, the program code executable by a processor to: train a neural model with the untrusted data set; classify each data point in the untrusted data set using the trained neural model; retain activations of one or more designated layers in the trained neural model; apply a clustering technique on the retained activations for each label, and for each cluster assess integrity of data in the cluster, including program code executable by the processor to analyze information from the untrusted data set and the clustered activations, the information comprising content of the data in the untrusted data set, noise distribution data with respect to the untrusted data set, and evidence of a preliminary cluster classification; responsive to the analysis, selectively determine a poisonous classification or a legitimate classification of the untrusted data set; and assign the selectively determined classification to the untrusted data set. 9. The computer program product of claim 8 , wherein integrity assessment of the cluster data further comprises program code executable by the processor to select a preliminary topic assignment or a topic assignment based on the analysis of the analyzed information. 10. The computer program product of claim 9 , wherein the topic assignment based on the analysis further comprises program code executable by the processor to analyze topic text indicative of the poisonous classification or the legitimate classification. 11. The computer program product of claim 8 , wherein the evidence of the preliminary cluster classification further comprises program code executable by the processor to analyze one or more of: known classification data associated with the untrusted data set; and/or determined classification data associated with the clustered activations. 12. The computer program product of claim 8 , wherein analysis of the noise distribution data further comprises program code executable by the processor to: select the noise distribution data from the group consisting of: noise data extracted through analysis of the untrusted data set and known noise distribution data provided with the untrusted data set. 13. The computer program product of claim 8 , further comprising program code executable by the processor to rank the integrity assessments of the clusters as a function of historical performance. 14. A method comprising: receiving, by a neural network, an untrusted data set, each data point of the untrusted data set having a label; training a neural model using the untrusted data set; classifying each data point in the untrusted data set using the trained neural model, and retaining activations of one or more designated layers in the trained neural model; applying a clustering technique on the retained activations for each label; assessing integrity of data in the untrusted data set, including analyzing information from the untrusted data set and the clustered activations, the information comprising content of the data in the untrusted data set, noise distribution data with respect to the untrusted data set, and evidence of a preliminary cluster classification; responsive to the analysis, selectively determining a poisonous classification or a legitimate classification of the untrusted data set; and assigning the selectively determined classification to the untrusted data set. 15. The method of claim 14 , wherein the cluster data includes a preliminary topic assignment or a topic assignment based on the analysis of the analyzed information. 16. The method of claim 15 , wherein the topic assignment based on the analysis includes topic text indicative of the poisonous classification or the legitimate classification. 17. The method of claim 14 , wherein the evidence of the preliminary cluster classification includes one or more of: known classification data associated with the untrusted data set; and/or determined classification data associated with the clustered activations. 18. The method of claim 14 , wherein: the noise distribution data is selected from the group consisting of: noise data extracted through analysis of the untrusted data set and known noise distribution data provided with the untrusted data set. 19. The method of claim 14 , wherein the assessing integrity of data in the untrusted data set comprises conducting a plurality of integrity assessments, and wherein the method further comprises ranking the integrity assessments of the clusters as a function of historical performance. 20. The method of claim 14 ,
Supervised learning · CPC title
Feedforward networks · CPC title
using clustering, e.g. of similar faces in social networks · CPC title
Feature selection, e.g. selecting representative features from a multi-dimensional feature space · CPC title
Validation; Performance evaluation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.