Systems and Methods for Increasing Robustness of Machine-Learned Models and Other Software Systems against Adversarial Attacks
US-2020201993-A1 · Jun 25, 2020 · US
US12019747B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12019747-B2 |
| Application number | US-202017068853-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 13, 2020 |
| Priority date | Oct 13, 2020 |
| Publication date | Jun 25, 2024 |
| Grant date | Jun 25, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
One or more computer processors determine a tolerance value, and a norm value associated with an untrusted model and an adversarial training method. The one or more computer processors generate a plurality of interpolated adversarial images ranging between a pair of images utilizing the adversarial training method, wherein each image in the pair of images is from a different class. The one or more computer processors detect a backdoor associated with the untrusted model utilizing the generated plurality of interpolated adversarial images. The one or more computer processors harden the untrusted model by training the untrusted model with the generated plurality of interpolated adversarial images.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: determining, by one or more computer processors, a tolerance value and a norm value associated with an untrusted model and an adversarial training method, wherein the norm value maximizes a loss function of the untrusted model on an input while keeping a size of a perturbation smaller than a specified epsilon and the tolerance value is a measure of robustness to a plurality of adversarial attacks of increasing strength, wherein the untrusted model is a neural network retrieved from an unverified source; generating, by one or more computer processors, a plurality of adversarial images that are linearly interpolated between a first image and a second image, wherein a classification of the first image is different than a classification of the second image; adding, by one or more computer processors, a perturbation to each generated adversarial image in the plurality of generated adversarial images, wherein the perturbation is adjusted utilizing the determined norm value and the tolerance value; analyzing, by one or more computer processors, one or more gradients associated with the plurality of adversarial images and one or more gradients associated with the first image and the second image; responsive to the one or more gradients associated with the plurality of adversarial images are different than the one or more gradients associated with the first image and the second image, detecting, by one or more computer processors, a backdoor associated with the untrusted model utilizing the generated plurality of adversarial images; hardening, by one or more computer processors, the untrusted model by training the untrusted model with the generated plurality of adversarial images; and displaying, by one or more computer processors, the one or more gradients associated with the plurality of adversarial images to a user. 2. The computer-implemented method of claim 1 , wherein generating the plurality of adversarial images that are linearly interpolated between the first image and the second image, comprises: iteratively performing, by one or more computer processors, one or more perturbations for each class contained in a testing set towards a specified class into a subset of adversarial images. 3. The computer-implemented method of claim 1 , further comprising: monitoring, by one or more computer processors, the untrusted model utilizing human-in-the-loop training methods. 4. The computer-implemented method of claim 3 , further comprising: periodically displaying, by one or more computer processors, one or more gradients associated with the untrusted model. 5. The computer-implemented method of claim 1 , further comprising: filtering, by one or more computer processors, one or more subsequent inputs that contain the detected backdoor. 6. The computer-implemented method of claim 1 , wherein the hardened model is deployed for inference. 7. The computer-implemented method of claim 1 , further comprising: receiving, by one or more computer processors, the untrusted model, associated pre-trained weights, a clean testing set, and the adversarial training method, wherein the clean testing set contains a plurality of images with associated labels. 8. A computer program product comprising: one or more computer readable storage media and program instructions stored on the one or more computer readable storage media, the stored program instructions comprising: program instructions to determine a tolerance value and a norm value associated with an untrusted model and an adversarial training method, wherein the norm value maximizes a loss function of the untrusted model on an input while keeping a size of a perturbation smaller than a specified epsilon and the tolerance value is a measure of robustness to a plurality of adversarial attacks of increasing strength, wherein the untrusted model is a neural network retrieved from an unverified source; program instructions to generate a plurality of adversarial images that are linearly interpolated between a first image and a second image wherein the first image and second image, wherein a classification of the first image is different than a classification of the second image; program instructions to add a perturbation to each generated adversarial image in the plurality of generated adversarial images, wherein the perturbation is adjusted utilizing the determined norm value and the tolerance value; program instructions to analyze one or more gradients associated with the plurality of adversarial images and one or more gradients associated with the first image and the second image; program instructions to, responsive to the one or more gradients associated with the plurality of adversarial images are different than the one or more gradients associated with the first image and the second image, detect a backdoor associated with the untrusted model utilizing the generated plurality of adversarial images program instructions to harden the untrusted model by training the untrusted model with the generated plurality of adversarial images; and displaying, by one or more computer processors, the one or more gradients associated with the plurality of adversarial images to a user. 9. The computer program product of claim 8 , wherein the program instructions, to generate the plurality of interpolated adversarial images that are linearly interpolated between the first image and the second image, comprise: program instructions to iteratively perform one or more perturbations for each class contained in a testing set towards a specified class into a subset of interpolated adversarial images. 10. The computer program product of claim 8 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to monitor the untrusted model utilizing human-in-the-loop training methods. 11. The computer program product of claim 10 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to periodically display one or more gradients associated with the untrusted model. 12. The computer program product of claim 8 , wherein the hardened model is deployed for inference. 13. A computer system comprising: one or more computer processors; one or more computer readable storage media; and program instructions stored on the computer readable storage media for execution by at least one of the one or more processors, the stored program instructions comprising: program instructions to determine a tolerance value and a norm value associated with an untrusted model and an adversarial training method, wherein the norm value maximizes a loss function of the untrusted model on an input while keeping a size of a perturbation smaller than a specified epsilon and the tolerance value is a measure of robustness to a plurality of adversarial attacks of increasing strength, wherein the untrusted model is a neural network retrieved from an unverified source; program instructions to generate a plurality of adversarial images that are linearly interpolated between a first image and a second image wherein the first image and second image, wherein a classification of the first image is different than a classification of the second image; program instructions to add a perturbation to each generated adversarial image in the plurality of generated adversarial images, wherein the perturbation is adjusted utilizing the determined norm value and the tolerance value; program instructions to analyze one or more gradients associated with the plurality of ad
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Adversarial learning · CPC title
Machine learning · CPC title
Inference or reasoning models · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.