Adversarial interpolation backdoor detection

US12019747B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12019747-B2
Application numberUS-202017068853-A
CountryUS
Kind codeB2
Filing dateOct 13, 2020
Priority dateOct 13, 2020
Publication dateJun 25, 2024
Grant dateJun 25, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

One or more computer processors determine a tolerance value, and a norm value associated with an untrusted model and an adversarial training method. The one or more computer processors generate a plurality of interpolated adversarial images ranging between a pair of images utilizing the adversarial training method, wherein each image in the pair of images is from a different class. The one or more computer processors detect a backdoor associated with the untrusted model utilizing the generated plurality of interpolated adversarial images. The one or more computer processors harden the untrusted model by training the untrusted model with the generated plurality of interpolated adversarial images.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: determining, by one or more computer processors, a tolerance value and a norm value associated with an untrusted model and an adversarial training method, wherein the norm value maximizes a loss function of the untrusted model on an input while keeping a size of a perturbation smaller than a specified epsilon and the tolerance value is a measure of robustness to a plurality of adversarial attacks of increasing strength, wherein the untrusted model is a neural network retrieved from an unverified source; generating, by one or more computer processors, a plurality of adversarial images that are linearly interpolated between a first image and a second image, wherein a classification of the first image is different than a classification of the second image; adding, by one or more computer processors, a perturbation to each generated adversarial image in the plurality of generated adversarial images, wherein the perturbation is adjusted utilizing the determined norm value and the tolerance value; analyzing, by one or more computer processors, one or more gradients associated with the plurality of adversarial images and one or more gradients associated with the first image and the second image; responsive to the one or more gradients associated with the plurality of adversarial images are different than the one or more gradients associated with the first image and the second image, detecting, by one or more computer processors, a backdoor associated with the untrusted model utilizing the generated plurality of adversarial images; hardening, by one or more computer processors, the untrusted model by training the untrusted model with the generated plurality of adversarial images; and displaying, by one or more computer processors, the one or more gradients associated with the plurality of adversarial images to a user. 2. The computer-implemented method of claim 1 , wherein generating the plurality of adversarial images that are linearly interpolated between the first image and the second image, comprises: iteratively performing, by one or more computer processors, one or more perturbations for each class contained in a testing set towards a specified class into a subset of adversarial images. 3. The computer-implemented method of claim 1 , further comprising: monitoring, by one or more computer processors, the untrusted model utilizing human-in-the-loop training methods. 4. The computer-implemented method of claim 3 , further comprising: periodically displaying, by one or more computer processors, one or more gradients associated with the untrusted model. 5. The computer-implemented method of claim 1 , further comprising: filtering, by one or more computer processors, one or more subsequent inputs that contain the detected backdoor. 6. The computer-implemented method of claim 1 , wherein the hardened model is deployed for inference. 7. The computer-implemented method of claim 1 , further comprising: receiving, by one or more computer processors, the untrusted model, associated pre-trained weights, a clean testing set, and the adversarial training method, wherein the clean testing set contains a plurality of images with associated labels. 8. A computer program product comprising: one or more computer readable storage media and program instructions stored on the one or more computer readable storage media, the stored program instructions comprising: program instructions to determine a tolerance value and a norm value associated with an untrusted model and an adversarial training method, wherein the norm value maximizes a loss function of the untrusted model on an input while keeping a size of a perturbation smaller than a specified epsilon and the tolerance value is a measure of robustness to a plurality of adversarial attacks of increasing strength, wherein the untrusted model is a neural network retrieved from an unverified source; program instructions to generate a plurality of adversarial images that are linearly interpolated between a first image and a second image wherein the first image and second image, wherein a classification of the first image is different than a classification of the second image; program instructions to add a perturbation to each generated adversarial image in the plurality of generated adversarial images, wherein the perturbation is adjusted utilizing the determined norm value and the tolerance value; program instructions to analyze one or more gradients associated with the plurality of adversarial images and one or more gradients associated with the first image and the second image; program instructions to, responsive to the one or more gradients associated with the plurality of adversarial images are different than the one or more gradients associated with the first image and the second image, detect a backdoor associated with the untrusted model utilizing the generated plurality of adversarial images program instructions to harden the untrusted model by training the untrusted model with the generated plurality of adversarial images; and displaying, by one or more computer processors, the one or more gradients associated with the plurality of adversarial images to a user. 9. The computer program product of claim 8 , wherein the program instructions, to generate the plurality of interpolated adversarial images that are linearly interpolated between the first image and the second image, comprise: program instructions to iteratively perform one or more perturbations for each class contained in a testing set towards a specified class into a subset of interpolated adversarial images. 10. The computer program product of claim 8 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to monitor the untrusted model utilizing human-in-the-loop training methods. 11. The computer program product of claim 10 , wherein the program instructions, stored on the one or more computer readable storage media, further comprise: program instructions to periodically display one or more gradients associated with the untrusted model. 12. The computer program product of claim 8 , wherein the hardened model is deployed for inference. 13. A computer system comprising: one or more computer processors; one or more computer readable storage media; and program instructions stored on the computer readable storage media for execution by at least one of the one or more processors, the stored program instructions comprising: program instructions to determine a tolerance value and a norm value associated with an untrusted model and an adversarial training method, wherein the norm value maximizes a loss function of the untrusted model on an input while keeping a size of a perturbation smaller than a specified epsilon and the tolerance value is a measure of robustness to a plurality of adversarial attacks of increasing strength, wherein the untrusted model is a neural network retrieved from an unverified source; program instructions to generate a plurality of adversarial images that are linearly interpolated between a first image and a second image wherein the first image and second image, wherein a classification of the first image is different than a classification of the second image; program instructions to add a perturbation to each generated adversarial image in the plurality of generated adversarial images, wherein the perturbation is adjusted utilizing the determined norm value and the tolerance value; program instructions to analyze one or more gradients associated with the plurality of ad

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Adversarial learning · CPC title

  • Machine learning · CPC title

  • Inference or reasoning models · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12019747B2 cover?
One or more computer processors determine a tolerance value, and a norm value associated with an untrusted model and an adversarial training method. The one or more computer processors generate a plurality of interpolated adversarial images ranging between a pair of images utilizing the adversarial training method, wherein each image in the pair of images is from a different class. The one or m…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 25 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).