Masking sensitive information in a document

US12088718B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12088718-B2
Application numberUS-202017073436-A
CountryUS
Kind codeB2
Filing dateOct 19, 2020
Priority dateOct 19, 2020
Publication dateSep 10, 2024
Grant dateSep 10, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The exemplary embodiments disclose a method, a computer program product, and a computer system for protecting sensitive information. The exemplary embodiments may include using an inverted text index for evaluating one or more statistical measures of an index token of the inverted text index, using the one or more statistical measures for selecting a set of candidate tokens, extracting metadata from the inverted text index, associating the set of candidate tokens with respective token metadata, tokenizing at least one document resulting in one or more document tokens, comparing the one or more document tokens with the set of candidate tokens, selecting a set of document tokens to be masked, selecting at least part of the set of document tokens that comprises sensitive information according to the associated token metadata, masking the at least part of the set of document tokens, and providing one or more masked documents.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer implemented method for protecting sensitive information in documents, the method comprising: providing, in a computer database, an inverted text index for a set of documents; evaluating via a processor one or more statistical measures of index tokens, respectively, of the inverted text index, the one or more statistical measures of the respective index token comprising at least one member selected from a group consisting of: one or more of a number of documents of the set of documents containing the index token, a frequency of occurrence of the index token in the set of documents, and a frequency of occurrence of a token type of the index token in the set of documents; selecting, based on the evaluation of the one or more statistical measures and via the processor, a set of candidate tokens that may contain sensitive information, the selecting the set of candidate tokens comprising comparing the one or more statical measures with a respective predefined threshold; extracting, via the processor, metadata from the inverted text index descriptive of the candidate tokens, respectively, wherein the extracted metadata comprises at least a token type of the index tokens, respectively, and a document identifier of a respective document containing a respective index token; receiving, via the processor, a request of at least one document; tokenizing, via the processor, the requested at least one document, resulting in document tokens; comparing, via the processor, the document tokens with the set of candidate tokens; selecting, via the processor, a set of document tokens to be masked based on the comparison; selecting, via the processor, at least part of the set of document tokens that comprises sensitive information according to the extracted metadata; masking, via the processor, the at least part of the set of document tokens in the at least one document, resulting in one or more masked documents; and providing, via the processor, the one or more masked documents. 2. The method of claim 1 , wherein the extracted metadata further comprises topic metadata, the topic metadata comprising at least one of a topic of the set of candidate tokens and a topic of a document containing the set of candidate tokens. 3. The method of claim 2 , further comprising: determining a token category of each token of the set of candidate tokens; inputting the token categories to an information governance tool; and receiving as output the topic metadata. 4. The method of claim 1 , wherein the token type comprises one or more of a text type and a number type. 5. The method of claim 1 , further comprising: storing the set of candidate tokens in association with the extracted metadata in a storage system; evaluating one or more statistical measures of updated index tokens, respectively, of an updated inverted text index; selecting, based on the one or more statistical measures, an updated set of candidate tokens that may contain sensitive information; extracting updated metadata from the updated inverted text index descriptive of the updated candidate tokens, respectively, wherein the extracted updated metadata comprises at least a token type of the updated index tokens, respectively, and a document identifier of a respective document containing an updated index token; storing the updated set of candidate tokens in association with the extracted updated metadata in the storage system to form an updated storage system; and selecting, from the updated storage system, updated document tokens for masking for new document retrieval requests. 6. The method of claim 1 , wherein: the selecting the at least part of the set of document tokens that comprises sensitive information according to the extracted metadata comprises running a classifier on the extracted metadata to classify the set of document tokens as sensitive or not sensitive tokens; and the selection is performed based on the classification. 7. The method of claim 1 , further comprising: determining a respective domain represented by a content of the requested at least one document, wherein the set of documents represents the determined domain and excludes the requested at least one document. 8. The method of claim 1 , wherein the set of documents comprises the requested at least one document. 9. The method of claim 1 , wherein the requested at least one document is an unstructured document. 10. A computer program product for protecting sensitive information in documents, the computer program product comprising one or more non-transitory computer-readable storage media and program instructions stored on the one or more non-transitory computer-readable storage media capable of causing a computer to perform a method comprising: providing an inverted text index for a set of documents; evaluating one or more statistical measures of index tokens, respectively, of the inverted text index, the one or more statistical measures of the respective index token comprising at least one member selected from a group consisting of: one or more of a number of documents of the set of documents containing the index token, a frequency of occurrence of the index token in the set of documents, and a frequency of occurrence of a token type of the index token in the set of documents; selecting, based on the evaluation of the one or more statistical measures, a set of candidate tokens that may contain sensitive information, the selecting the set of candidate tokens comprising comparing the one or more statical measures with a respective predefined threshold; extracting metadata from the inverted text index descriptive of the candidate tokens, respectively, wherein the extracted metadata comprises at least a token type of the index tokens, respectively, and a document identifier of a respective document containing a respective index token; receiving a request of at least one document; tokenizing the requested at least one document, resulting in document tokens; comparing the document tokens with the set of candidate tokens; selecting a set of document tokens to be masked based on the comparison; selecting at least part of the set of document tokens that comprises sensitive information according to the extracted metadata; masking the at least part of the set of document tokens in the at least one document, resulting in one or more masked documents; and providing the one or more masked documents. 11. The computer program product of claim 10 , wherein the extracted metadata further comprises topic metadata comprising at least one of a topic of the set of candidate tokens and a topic of a document containing the set of candidate tokens. 12. The computer program product of claim 11 , further comprising: determining a token category of each token of the set of candidate tokens; inputting the token categories to an information governance tool; and receiving as output the topic metadata. 13. The computer program product of claim 10 , wherein the token type comprises one or more of a text type and a number type. 14. A computer system for protecting sensitive information in documents, the computer system comprising one or more computer processors, one or more computer-readable storage media, and program instructions stored on the one or more of the computer-readable storage media for execution by at least one of the one or more processors to perform a method comprising: providing an inverted text index for a set of documents; evaluating one or more statistical measures of index tokens, respectively, of the inverted text index, the one or more statistical measures of the r

Assignees

Inventors

Classifications

  • Creation or modification of classes or clusters · CPC title

  • Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title

  • Document management systems · CPC title

  • G06F40/284Primary

    Lexical analysis, e.g. tokenisation or collocates · CPC title

  • by anonymising data, e.g. decorrelating personal data from the owner's identification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12088718B2 cover?
The exemplary embodiments disclose a method, a computer program product, and a computer system for protecting sensitive information. The exemplary embodiments may include using an inverted text index for evaluating one or more statistical measures of an index token of the inverted text index, using the one or more statistical measures for selecting a set of candidate tokens, extracting metadata…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/284. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 10 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).