Low-memory footprint fingerprinting and indexing for efficiently measuring document similarity and containment

US9514312B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9514312-B1
Application numberUS-201414479205-A
CountryUS
Kind codeB1
Filing dateSep 5, 2014
Priority dateSep 5, 2014
Publication dateDec 6, 2016
Grant dateDec 6, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and system for low-memory footprint fingerprinting and indexing for efficiently measuring document similarity and containment are described. A method may include extracting, by a processor, content from a set of one or more data files. The method may also determine a size of the content and apply a hash function to the content to generate multiple hashes. The method selects a constrained set of the hashes to generate a fixed-size fingerprint representative of the content when the size of the content is greater than a threshold size. The method stores the fixed-size fingerprint representative of the content in an endpoint index for at least partial file content matching by an endpoint device. The method may employ a statistical-based optimization to speed-up query time.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for use in data loss prevention comprising: extracting, by a processor, content from a set of one or more data files to be protected by a data loss prevention (DLP) policy designed to detect resemblance or containment relationships between two files even when their contents are not exact matches, wherein the content comprises sensitive data, and wherein the set of one or more data files comprises both sensitive data and non-sensitive data; determining, by the processor, a size of the content; applying, by the processor, a hash function to the content to generate a plurality of hashes; selecting, by the processor, a constrained set of the plurality of hashes to generate a fixed-size fingerprint representative of the content when the size of the content is greater than a threshold size; determining, by the processor, a quantity of hashes based on the size of the content when the size of the content is equal to or less than the threshold size, wherein the quantity is proportional to the size of the content and the determined quantity is less than the quantity of the plurality of hashes; selecting, by the processor, the determined quantity of hashes from the plurality of hashes to generate a limited-size fingerprint representative of the content when the size of the content is equal to or less than the threshold size; and storing the fixed-size fingerprint or the limited-size fingerprint in an endpoint index for at least partial file content matching by an endpoint device, wherein the endpoint index comprises at least one fixed-size fingerprint or limited-size fingerprint corresponding to sensitive data, and wherein the endpoint index may be used for querying fingerprints of other data sources with the fingerprints contained in the endpoint index to determine if the other data sources contain content corresponding to the extracted content. 2. The method of claim 1 , wherein the endpoint index comprises a mapping of at least some of the selected hashes in the stored fixed-size fingerprint or stored limited-size fingerprint to the set of one or more data files to be protected that also include any of the plurality of fixed-size fingerprints or the limited-sized fingerprints representative of the content. 3. The method of claim 1 , further comprising: applying, by the processor, a second hash function to each of the set of one or more data files to generate an exact-file fingerprint for use with a DLP further designed to detect exact matches, wherein the hash function comprises generating exact-file signatures of the set of one or more data files; and storing the exact-file fingerprints in the endpoint index for exact-file matching by the endpoint device. 4. The method of claim 3 , wherein the generating the exact-file fingerprints of the set of one or more data files comprises: determining whether the content extracted comprises text data or non-text data; applying a cryptographic hash function to the non-text data; and applying the hash function to the text data. 5. The method of claim 1 , further comprising: normalizing the content to generate a plurality of alpha numeric characters; and applying a k-gram operation to the plurality of alpha numeric characters, wherein applying the hash function comprises applying a rolling hash function. 6. The method of claim 1 , further comprising: extracting statistical information to model a distribution of the quantity of hashes from the fingerprints stored in the endpoint index; and selecting a query threshold based on the distribution, wherein the query threshold is selected by determining the maximum value of a subset of the smallest hashes of the hashes stored in the endpoint index. 7. A system comprising: a memory; and a processor coupled with the memory, the processor to: extract content from a set of one or more data files to be protected by a data loss prevention (DLP) policy designed to detect resemblance or containment relationships between two files even when their contents are not exact matches, wherein the content comprises sensitive data and wherein the set of one or more data files comprises both sensitive data and non-sensitive data; determine a size of the content; apply a hash function to the content to generate a plurality of hashes; select a constrained set of the plurality of hashes to generate a fixed-size fingerprint representative of the content when the size of the content is greater than a threshold size; determine a quantity of hashes based on the size of the content when the size of the content is equal to or less than the threshold size, wherein the quantity is proportional to the size of the content and the determined quantity is less than the quantity of the plurality of hashes; select the determined quantity of hashes from the plurality of hashes to generate a limited-size fingerprint representative of the content when the size of the content is equal to or less than the threshold size; and store the fixed-size fingerprint or the limited-size fingerprint in an endpoint index for at least partial file content matching by an endpoint device, wherein the endpoint index comprises at least one fixed-size fingerprint or limited-size fingerprint corresponding to sensitive data and wherein the endpoint index may be used for querying fingerprints of other data sources with the fingerprints contained in the endpoint index to determine if the other data sources contain content corresponding to the extracted content. 8. The system of claim 7 , wherein the endpoint index comprises a mapping of at least some of the selected hashes in the stored fixed-size fingerprint or stored limited-size fingerprint to a plurality of data files that also include any of the plurality of fixed-size fingerprints or the limited-sized fingerprints representative of the content. 9. The system of claim 7 , wherein the processor is further to: apply a second hash function to each of the set of one or more data files to generate an exact-file fingerprint, wherein the second hash function comprises generating exact-file signatures of the set of one or more data files for use with a DLP further designed to detect exact matches; and store the exact-file fingerprints in the endpoint index for exact-file matching by the endpoint device. 10. The system of claim 9 , wherein the processor is further to: determine whether the content extracted comprises text data or non-text data; apply a cryptographic hash function to the non-text data; and apply the hash function to the text data. 11. The system of claim 7 , wherein the processor is further to: normalize the content to generate a plurality of alpha numeric characters; and apply a k-gram operation to the plurality of alpha numeric characters; and apply a rolling hash function as the hash function. 12. The system of claim 7 , wherein the processor is further to: extract statistical information to model a distribution of the quantity of hashes from the fingerprints stored in the endpoint index; and select a query threshold based on the distribution, wherein the query threshold is selected by determining the maximum value of a subset of the smallest hashes of the hashes stored in the endpoint index. 13. A non-transitory computer readable storage medium including instructions that, when executed by a processor, cause the processor to perform operations comprising: extracting content from a set of one or more data files to be protected by a data loss prevention (DLP) policy designed to detect resemblance or containment relationships between two files even when their contents are not exact matches, wherein the content c

Assignees

Inventors

Classifications

  • Tools and structures for managing or administering access control systems · CPC title

  • Physics · mapped topic

  • G06F21/60Primary

    Protecting data · CPC title

  • Physics · mapped topic

  • G06F16/152Primary

    using file content signatures, e.g. hash values · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9514312B1 cover?
A method and system for low-memory footprint fingerprinting and indexing for efficiently measuring document similarity and containment are described. A method may include extracting, by a processor, content from a set of one or more data files. The method may also determine a size of the content and apply a hash function to the content to generate multiple hashes. The method selects a constrain…
Who is the assignee on this patent?
Symantec Corp
What technology area does this patent fall under?
Primary CPC classification G06F21/60. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 06 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).