Detection of confidential information

US9569528B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9569528-B2
Application numberUS-24550708-A
CountryUS
Kind codeB2
Filing dateOct 3, 2008
Priority dateOct 3, 2008
Publication dateFeb 14, 2017
Grant dateFeb 14, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Among other aspects disclosed are a method and system for detecting confidential information. The method includes reading stored data and identifying strings within the stored data, where each string includes a sequence of consecutive bytes which all have values that are in a predetermined subset of possible values. For each of at least some of the strings, determining if the string includes bytes representing one or more format matches, wherein a format match includes a set of values that match a predetermined format associated with confidential information. For each format match, testing the values that match the predetermined format with a set of rules associated with the confidential information to determine whether the format match is an invalid format match that includes one or more invalid values and calculating a score for the stored data, based at least in part upon the ratio of a count of invalid format matches to a count of other format matches.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for detecting confidential information, the method including: reading, by at least one processor, stored data; identifying, by the at least one processor, strings within the stored data, where each string includes a sequence of consecutive bytes that all have values that are in a predetermined subset of possible values; applying, by the at least one processor, a first set of one or more rules to identify one or more format matches based on the strings, wherein each format match includes at least a portion of one of the strings that matches a predetermined format associated with a first type of confidential information; for each determined format match, testing, by the at least one processor, the respective format match using a second set of one or more rules associated with the first type of confidential information to determine whether the format match is an invalid format match in which the portion of one of the strings that matches the predetermined format includes one or more invalid values that is or are invalid for the first type of confidential information; determining, by the at least one processor, a first count of invalid format matches; determining, by the at least one processor, a second count of format matches that do not include invalid values that are invalid for the first type of confidential information, in which the format matches are identified by the first set of one or more rules, and whether the format matches include invalid values is determined by the second set of one or more rules; applying, by the at least one processor, a third set of one or more rules to each of the identified strings to determine whether there is a format match in which at least a portion of the string matches a predetermined format associated with a second type of confidential information, and producing a second set of format matches; for each string associated with a format match in the second set of format matches, applying, by the at least one processor, a fourth set of one or more rules associated with the second type of confidential information to the string to determine whether the format match is an invalid format match in which the portion of the string matching the predetermined format associated with the second type of confidential information does not include a valid value for the second type of confidential information, and producing a second set of invalid format matches; determining, by the at least one processor, a third count of the matches in the second set of invalid format matches for the second type of confidential information; determining, by the at least one processor, a fourth count of matches in the second set of format matches for the second type of confidential information that do not include invalid values that are invalid for the second type of confidential information as determined according to the fourth set of one or more rules; and flagging, by the at least one processor, the stored data as having a probability to contain confidential information based on, at least in part, a first score calculated as a function of the first count, the second count, the third count, and the fourth count. 2. The method of claim 1 , wherein the confidential information is a credit card number. 3. The method of claim 2 , wherein a format match is determined to occur when the number of bytes with values representing digits detected in the string is equal to a number of digits in a standard format for credit card numbers. 4. The method of claim 3 , wherein the rules associated with credit card numbers include specification of a list of valid issuer identification numbers. 5. The method of claim 3 , wherein the rules associated with credit card numbers include specification of a check sum algorithm. 6. The method of claim 1 , wherein the confidential information is a social security number. 7. The method of claim 6 , wherein a format match is determined to occur when the number of bytes with values representing digits detected in the string is equal to nine. 8. The method of claim 7 , wherein the rules associated with social security numbers include specification of a valid subset of values for the number represented by the first five digits of the social security number. 9. The method of claim 1 , wherein the confidential information is a telephone number. 10. The method of claim 9 , wherein a format match is determined to occur when the number of bytes with values representing digits detected in the string is equal to ten or the number of digits detected in the string is equal to eleven digits with the first digit being “1”. 11. The method of claim 10 , wherein the rules associated with telephone numbers include specification of a list of valid area codes. 12. The method of claim 10 , wherein the rules associated with telephone numbers include specification that the first digit after the area code must not be a one or a zero. 13. The method of claim 1 , wherein the confidential information is a zip code. 14. The method of claim 13 , wherein a format match is determined to occur when a sequence of bytes is detected consisting of either five bytes with values representing digits or ten bytes with values representing nine digits with a hyphen between the fifth and sixth digits. 15. The method of claim 14 , wherein the rules associated with zip codes include specification of a list of valid five digit zip codes. 16. The method of claim 1 , further including: for each string, determining if the string includes one or more words that match a name, wherein a word is sequence of consecutive bytes within a string that all have values representing alpha-numeric characters, and a name is a sequence of characters from a list of such sequences that are commonly used to refer to individual people; and calculating a second score for the stored data, based at least in part upon a count of names detected in the stored data. 17. The method of claim 16 , wherein the list of names is divided into two subsets: first names and last names. 18. The method of claim 17 , further including: for each string, determining if the string includes one or more full names, wherein full names are sequences of characters consisting of a name from the list of first names followed by space and followed by a name from the list of last names; and calculating a third score for the stored data, based at least in part upon a count of full names detected. 19. The method of claim 16 , wherein each of the names in the list is associated with a frequency count and an average frequency count for the names occurring in the stored data is calculated and the second score for the stored data is calculated based at least in part upon the average frequency count. 20. The method of claim 19 , wherein the average frequency count is disregarded if the number of names detected in the stored data is less than a threshold. 21. The method of claim 1 , further including: for each string counting the number of words consisting of two letters, wherein a word is sequence of consecutive bytes within a string that all have values representing alpha-numeric characters. 22. The method of claim 21 , further including: for each two letter word, determining if the two letter word is a valid state abbreviation; and calculating a second score for the stored data based at least in part upon a count of valid state abbreviations and a count of two letter words.

Assignees

Inventors

Classifications

  • G06F16/334Primary

    Query execution (filtering based on additional data G06F16/335) · CPC title

  • involving long-term monitoring or reporting · CPC title

  • Administration; Management · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9569528B2 cover?
Among other aspects disclosed are a method and system for detecting confidential information. The method includes reading stored data and identifying strings within the stored data, where each string includes a sequence of consecutive bytes which all have values that are in a predetermined subset of possible values. For each of at least some of the strings, determining if the string includes by…
Who is the assignee on this patent?
Fournier David, Ab Initio Technology Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/334. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 14 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).