Identifying confidential data in a data item by comparing the data item to similar data items from alternative sources

US9489376B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9489376-B2
Application numberUS-201313732501-A
CountryUS
Kind codeB2
Filing dateJan 2, 2013
Priority dateJan 2, 2013
Publication dateNov 8, 2016
Grant dateNov 8, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, apparatus and computer program product to identify confidential information in a document. To examine a document for inclusion of confidential information, the document is compared against documents having similar structure and content from one or more other sources. When comparing documents (of similar structure and content) from different sources, confidential information is then made to stand out by searching for terms (from the sources) that are not shared between or among them. In contrast, common words or terms that are shared across the sources are ignored as likely being non-confidential information; what remains as not shared may then be classified as confidential information and protected accordingly (e.g., by omission, redaction, substitution or the like). Using this technique, non-confidential information may be safely segmented from confidential information in a dynamic, automated manner.

First claim

Opening claim text (preview).

Having described our invention, what we now claim is as follows: 1. A method of identifying potential confidential information in a data item, the data item associated with a source, comprising: obtaining, from each of a set of alternative sources, a data item of a same type and format as the data item; comparing, using a hardware element, the data item to the data item(s) obtained from the set of alternative sources to identify occurrences of particular pieces of information in the data item, wherein multiple occurrences of a particular piece of information within a data item from each alternative source are treated as a single occurrence; and based on the occurrences of particular pieces of information in the data item and a given sensitivity criteria, and without knowledge that the particular pieces of information are considered by the source to be confidential, segmenting one or more pieces of information in the data item as representing the potential confidential information. 2. The method as described in claim 1 further including highlighting the one or more pieces of information. 3. The method as described in claim 2 further including taking a given action with respect to the one or more pieces of information that have been highlighted. 4. The method as described in claim 3 wherein the given action is one of: removing the piece of information, redacting the piece of information, and substituting non-confidential data for the piece of information. 5. The method as described in claim 3 further including outputting the data item without the one or more pieces of information. 6. The method as described in claim 1 wherein the data item is one of: a document, a report, a file, a log, a message, an email, and a communication. 7. The method as described in claim 1 wherein the given sensitivity criteria is a configurable threshold. 8. Apparatus, comprising: a processor; computer memory holding computer program instructions that when executed by the processor perform a method of identifying potential confidential information in a data item, the data item associated with a source, the method comprising: obtaining, from each of a set of alternative sources, a data item of a same type and format as the data item; comparing the data item to the data item(s) obtained from the set of alternative sources to identify occurrences of particular pieces of information in the data item, wherein multiple occurrences of a particular piece of information within a data item from each alternative source are treated as a single occurrence; and based on the occurrences of particular pieces of information in the data item and a given sensitivity criteria, and without knowledge that the particular pieces of information are considered by the source to be confidential, segmenting one or more pieces of information in the data item as representing the potential confidential information. 9. The apparatus as described in claim 8 wherein the method further includes highlighting the one or more pieces of information. 10. The apparatus as described in claim 9 wherein the method further includes taking a given action with respect to the one or more pieces of information that have been highlighted. 11. The apparatus as described in claim 10 wherein the given action is one of: removing the piece of information, redacting the piece of information, and substituting non-confidential data for the piece of information. 12. The apparatus as described in claim 10 wherein the method further includes outputting the data item without the one or more pieces of information. 13. The apparatus as described in claim 8 wherein the data item is one of: a document, a report, a file, a log, a message, an email, and a communication. 14. The apparatus as described in claim 8 wherein the given sensitivity criteria is a configurable threshold. 15. A computer program product in a non-transitory computer-readable storage medium in a data processing system, the computer program product holding computer program instructions which, when executed by the data processing system, perform a method of identifying potential confidential information in a data item, the data item associated with a source, the method comprising: obtaining, from each of a set of alternative sources, a data item of a same type and format as the data item; comparing the data item to the data item(s) obtained from the set of alternative sources to identify occurrences of particular pieces of information in the data item, wherein multiple occurrences of a particular piece of information within a data item from each alternative source are treated as a single occurrence; and based on the occurrences of particular pieces of information in the data item and a given sensitivity criteria, and without knowledge that the particular pieces of information are considered by the source to be confidential, segmenting one or more pieces of information in the data item as representing the potential confidential information. 16. The computer program product as described in claim 15 wherein the method further includes highlighting the one or more pieces of information. 17. The computer program product as described in claim 16 wherein the method further includes taking a given action with respect to the one or more pieces of information that have been highlighted. 18. The computer program product as described in claim 17 wherein the given action is one of: removing the piece of information, redacting the piece of information, and substituting non-confidential data for the piece of information. 19. The computer program product as described in claim 17 wherein the method further includes outputting the data item without the one or more pieces of information. 20. The computer program product as described in claim 15 wherein the data item is one of: a document, a report, a file, a log, a message, an email, and a communication. 21. The computer program product as described in claim 15 wherein the given sensitivity criteria is a configurable threshold. 22. Apparatus, comprising: a display interface; a processor; computer memory holding computer program instructions executed by the processor to identify potential confidential information in a data item, the data item associated with a source, by (i) comparing the data item to data items of a similar type and format received from a set of alternative sources to identify occurrences of particular pieces of information in the data item, and (ii) based on the occurrences of particular pieces of information in the data item, and without knowledge that the particular pieces of information are considered by the source to be confidential, identifying one or more pieces of information in the data item as representing the potential confidential information, wherein multiple occurrences of a particular piece of information within a data item from at least one particular alternative source are treated as a single occurrence, and (iii) outputting, on the display interface, a representation of the data item with the one or more pieces of information representing potential confidential information highlighted.

Assignees

Inventors

Classifications

  • Protecting personal data, e.g. for financial or medical purposes · CPC title

  • G06F40/289Primary

    Phrasal analysis, e.g. finite state techniques or chunking · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9489376B2 cover?
A method, apparatus and computer program product to identify confidential information in a document. To examine a document for inclusion of confidential information, the document is compared against documents having similar structure and content from one or more other sources. When comparing documents (of similar structure and content) from different sources, confidential information is then ma…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F21/6245. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 08 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).