Layered cybersecurity using spurious data samples

US2024283822A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2024283822-A1
Application numberUS-202318170492-A
CountryUS
Kind codeA1
Filing dateFeb 16, 2023
Priority dateFeb 16, 2023
Publication dateAug 22, 2024
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In some aspects, a computing system may iterate between adding spurious data to the dataset and training a model on the dataset. If the model's performance has not dropped by more than a threshold amount, then additional spurious data may be added to the dataset until the desired amount of performance decrease has been achieved. the computing system may determine the amount of impact each feature has on a model's output. The computing system may generate a spurious data sample by modifying values of features that are more impactful than other features. The computing system may repeatedly modify the spurious data that is stored in a dataset. If a cybersecurity incident occurs (e.g., the dataset is stolen or leaked), the system may identify when the cybersecurity incident took place based on the spurious data that is stored in the dataset.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system for providing an additional layer of data security to prevent malicious actors from using data by modifying a dataset to include spurious data, the system comprising: one or more processors; and a non-transitory, computer readable medium having instructions recorded thereon that, when executed by the one or more processors, cause operations comprising: obtaining a first dataset comprising a set of original data samples, wherein each data sample comprises a label indicating a correct classification; generating a key that indicates a location within the first dataset where spurious data should be stored; generating, based on the set of original data samples, a first set of spurious data samples for the first dataset, wherein the first set of spurious data samples, when used to train a first machine learning model, cause the first machine learning model to generate incorrect output for more than a threshold number of data samples of the set of original data samples; based on the key, adding the first set of spurious data samples to the first dataset; training a machine learning model based on the first dataset; based on a performance metric of the machine learning model satisfying a threshold, generating a second set of spurious data samples; adding the second set of spurious data samples to the first dataset; and based on determining a request to use the first dataset is not associated with a malicious computing device, removing the first set of spurious data samples and the second set of spurious data from the first dataset. 2 . A method comprising: obtaining a first dataset comprising a set of original data samples; generating a key that indicates a location within the first dataset where spurious data should be stored; generating, based on the set of original data samples, a first set of spurious data samples for the first dataset; based on the key, adding the first set of spurious data samples to the first dataset; based on determining that the first set of spurious data samples fails to modify performance of a machine learning model, adding a second set of spurious data samples to the first dataset; and based on determining a request to use the first dataset is not associated with a malicious computing device, removing the first set of spurious data samples and the second set of spurious data samples from the first dataset. 3 . The method of claim 2 , wherein adding the second set of spurious data samples to the first dataset comprises: training a machine learning model based on the first dataset; based on a performance metric of the machine learning model satisfying a threshold, generating a second set of spurious data, wherein the performance metric comprises accuracy, logarithmic loss, F1 score, precision, recall, or mean squared error; and based on the performance metric of the machine learning model satisfying the threshold, adding the second set of spurious data to the first dataset. 4 . The method of claim 2 , wherein generating the first set of spurious data samples comprises: generating, based on a first data sample of the set of original data samples, an explanation indicating a feature that is more influential than other features of the first data sample for output generated by the machine learning model, the output corresponding to the first data sample; and generating a spurious data sample of the first set of spurious data samples by: generating a copy of the first data sample; and modifying a value of the copy of the first data sample, the value corresponding to the feature. 5 . The method of claim 2 , wherein the first set of spurious data samples, when used to train the machine learning model, cause the machine learning model to generate incorrect output for more than a threshold number of data samples of the set of original data samples. 6 . The method of claim 2 , wherein generating the first set of spurious data samples comprises: determining a modification to a value of a first data sample of the set of original data samples, wherein the modification causes the machine learning model to output an incorrect class of the first data sample; and generating a spurious data sample comprising a label corresponding to the first data sample and a result of the modification. 7 . The method of claim 2 , wherein removing the first set of spurious data samples from the first dataset comprises: determining a computing device that has experienced more than a threshold amount of cyber security attacks within a time period; and based on the computing device having experienced more than the threshold amount of cyber security attacks within the time period, removing the first set of spurious data samples from the first dataset after the computing device has completed preprocessing the first dataset. 8 . The method of claim 2 , wherein removing the first set of spurious data samples from the first dataset comprises: determining that the first dataset is to be used to train a machine learning model; and based on determining that the first dataset is to be used to train the machine learning model, removing the first set of spurious data samples from the first dataset. 9 . The method of claim 2 , wherein generating the second set of spurious data samples comprises: comparing output of a first machine learning model with output of a second machine learning model; and based on the output of the first machine learning model satisfying a similarity threshold to the output of the second machine learning model, generating the second set of spurious data samples. 10 . The method of claim 2 , further comprising steps for generating the first set of spurious data samples. 11 . The method of claim 2 , wherein the key indicates a plurality of rows within the first dataset where spurious data samples should be placed. 12 . A non-transitory, computer-readable medium comprising instructions that when executed by one or more processors, cause operations comprising: obtaining a first dataset comprising a set of original data samples; generating a key that indicates a location within the first dataset where spurious data should be stored; generating, based on the set of original data samples, a first set of spurious data samples for the first dataset; based on the key, adding the first set of spurious data samples to the first dataset; and based on determining a request to use the first dataset is not associated with a malicious computing device, removing the first set of spurious data samples from the first dataset. 13 . The medium of claim 12 , wherein adding the first set of spurious data samples to the first dataset comprises: training a machine learning model based on the first dataset; based on a performance metric of the machine learning model satisfying a threshold, generating a second set of spurious data, wherein the performance metric comprises accuracy, logarithmic loss, F1 score, precision, recall, or mean squared error; and based on the performance metric of the machine learning model satisfying the threshold, adding the first set of spurious data samples to the first dataset. 14 . The medium of claim 12 , wherein generating the first set of spurious data samples comprises: generating, based on a first data sample of the set of original data samples, an explanation indicating a feature that is more influential than other features of the first data sample for output generated by a machine learning model, the output corresponding to the first data sample; and generating a spurious data sample of the first set of spurious d

Assignees

Inventors

Classifications

  • using deception as countermeasure, e.g. honeypots, honeynets, decoys or entrapment · CPC title

  • using machine learning or artificial intelligence · CPC title

  • Event detection, e.g. attack signature detection · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2024283822A1 cover?
In some aspects, a computing system may iterate between adding spurious data to the dataset and training a model on the dataset. If the model's performance has not dropped by more than a threshold amount, then additional spurious data may be added to the dataset until the desired amount of performance decrease has been achieved. the computing system may determine the amount of impact each featu…
Who is the assignee on this patent?
Capital One Services Llc
What technology area does this patent fall under?
Primary CPC classification H04L63/1491. Mapped technology areas include Electricity.
When was this patent published?
Publication date Thu Aug 22 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).