Confidence level threshold selection assistance for a data loss prevention system using machine learning

US9691027B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9691027-B1
Application numberUS-201113324987-A
CountryUS
Kind codeB1
Filing dateDec 13, 2011
Priority dateDec 14, 2010
Publication dateJun 27, 2017
Grant dateJun 27, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Machine-learning based detection (MLD) profiles can be used to identify sensitive information in documents. The MLD profile can be used to generate a confidence value for the document that expresses the degree of confidence with which the MLD profile can classify the document as sensitive or not. In one embodiment, a data loss prevention system provides or suggests a confidence level threshold to a user of the data loss prevention system by providing a confidence level threshold for the MLD profile to the user, the confidence level threshold to be used as the boundary between sensitive data and non-sensitive data. In one embodiment the provided confidence level threshold is determined by scanning a random data set using the MLD profile.

First claim

Opening claim text (preview).

We claim: 1. A method comprising: training, by a processing device, a machine learning-based detection (MLD) profile, wherein the MLD profile is used to classify new data as sensitive data or as non-sensitive data; and providing a confidence level threshold for the MLD profile to a user, wherein the confidence level threshold is used as a boundary between the sensitive data and the non-sensitive data, wherein providing the confidence level threshold comprises setting a default value of a confidence level threshold user interface element to the confidence level threshold, and wherein the confidence level threshold user interface element allows the user to change the confidence level threshold, wherein the MLD profile is used to classify new data by using the MLD profile to assign a confidence value to the new data and classifying the new data as the sensitive data in response to determining that the confidence value of the new data is above the confidence level threshold, wherein the confidence value is from a range of confidence values, and wherein a higher value in the range of confidence values indicates a higher likelihood of being the sensitive data than a lower value in the range of confidence values. 2. The method of claim 1 , further comprising scanning a random data set using the MLD profile to determine the confidence level threshold. 3. The method of claim 2 , wherein scanning the random data set using the MLD profile comprises determining a confidence value for each random datum in the random data set using the MLD profile. 4. The method of claim 3 , wherein each random datum in the random data set comprises a document from a random document set. 5. The method of claim 3 , wherein each random datum in the random data set comprises a web page selected randomly from a subset of public web pages. 6. The method of claim 3 , wherein scanning the random data set using the MLD profile further comprises selecting the confidence value of the random datum with the highest confidence value as the confidence level threshold. 7. The method of claim 3 , wherein the method further comprises testing the confidence level threshold using a test data set that is different from the random data set to determine whether the confidence level threshold has a threshold gap between a distribution of confidence values for ones of the test data set classified as the sensitive data and a distribution of confidence values for ones of the test data set classified as the non-sensitive data. 8. The method of claim 7 , further comprising selecting a second random data set in response to the determination that the confidence level threshold does not have the threshold gap between the distribution of confidence values for the ones of the test data set classified as the sensitive data and the distribution of confidence values for the ones of the test data set classified as the non-sensitive data. 9. The method of claim 1 , wherein training comprises extracting features using a user-selectable feature extraction algorithm. 10. A non-transitory computer-readable storage medium having instructions stored therein that, when executed by a processing device, cause the processing device to perform operations comprising: training, by the processing device, a machine learning-based detection (MLD) profile, wherein the MLD profile is used to classify new data as sensitive data or as non-sensitive data; scanning a random data set using the MLD profile to determine a confidence level threshold; and providing the confidence level threshold for the MLD profile to a user, wherein the confidence level threshold is used as a boundary between the sensitive data and the non-sensitive data, wherein the MLD profile is used to classify new data by using the MLD profile to assign a confidence value to the new data and classifying the new data as the sensitive data in response to determining that the confidence value of the new data is above the confidence level threshold, wherein the confidence value is from a range of confidence values, and wherein a higher value in the range of confidence values indicates a higher likelihood of being the sensitive data than a lower value in the range of confidence values. 11. The non-transitory computer-readable storage medium of claim 10 , wherein providing the confidence level threshold comprises setting a default value of a confidence level threshold user interface element to the confidence level threshold, and wherein the confidence level threshold user interface element allows the user to change the confidence level threshold. 12. The non-transitory computer-readable storage medium of claim 10 , wherein scanning the random data set using the MLD profile comprises determining a confidence value for each random datum in the random data set using the MLD profile. 13. The non-transitory computer-readable storage medium of claim 12 , wherein each random datum in the random data set comprises a document from a random document set. 14. The non-transitory computer-readable storage medium of claim 12 , wherein scanning the random data set using the MLD profile further comprises selecting the confidence value of the random datum with the highest confidence value as the confidence level threshold. 15. The non-transitory computer-readable storage medium of claim 10 , wherein training comprises extracting features using a user-selectable feature extraction algorithm. 16. A system comprising: a memory to store instructions for a machine learning manager; and a processing device to execute the instructions to: train a machine learning-based detection (MLD) profile, wherein the MLD profile is used to classify new data as sensitive data or as non-sensitive data; scan a random data set using the MLD profile to determine a confidence level threshold; and provide the confidence level threshold for the MLD profile to a user, wherein the confidence level threshold is used as a boundary between the sensitive data and the non-sensitive data, wherein the MLD profile is used to classify new data by using the MLD profile to assign a confidence value to the new data and classifying the new data as the sensitive data in response to determining that the confidence value of the new data is above the confidence level threshold, wherein the confidence value is from a range of confidence values, and wherein a higher value in the range of confidence values indicates a higher likelihood of being the sensitive data than a lower value in the range of confidence values. 17. The system of claim 16 , wherein to scan the random data set the processing device is to determine a confidence value for each random datum in the random data set using the MLD profile, and wherein to determine the confidence level threshold the processing device is to select the confidence value of the random datum with the highest confidence value as the confidence level threshold. 18. The system of claim 17 , wherein each random datum in the random data set comprises a document from a random document set, and wherein the processing device is further to test the confidence level threshold using a test document set that is different from the random document set to determine whether the confidence level threshold has a threshold gap between a distribution of confidence values for sensitive ones of the test document set and a distribution of confidence values for non-sensitive ones of the test document set. 19. The system of claim 16 , wherein to provide the confidence level threshold the processing device is to s

Assignees

Inventors

Classifications

  • G06F21/604Primary

    Tools and structures for managing or administering access control systems · CPC title

  • G06N5/025Primary

    Extracting rules from data · CPC title

  • using filtering, e.g. reduction of information by using priority, element types, position or time · CPC title

  • using management policies (point-in-time backing up or restoration of persistent data G06F11/1446; file migration policies for HSM systems G06F16/185) · CPC title

  • Details of user interfaces specifically adapted to file systems, e.g. browsing and visualisation, 2d or 3d GUIs (query results presentation G06F16/156) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9691027B1 cover?
Machine-learning based detection (MLD) profiles can be used to identify sensitive information in documents. The MLD profile can be used to generate a confidence value for the document that expresses the degree of confidence with which the MLD profile can classify the document as sensitive or not. In one embodiment, a data loss prevention system provides or suggests a confidence level threshold …
Who is the assignee on this patent?
Sawant Shitalkumar S, Shrowty Vikram, Dicorpo Philip, and 1 more
What technology area does this patent fall under?
Primary CPC classification G06F21/604. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 27 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).