Building data platform with a graph change feed
US-12040911-B2 · Jul 16, 2024 · US
US9514312B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9514312-B1 |
| Application number | US-201414479205-A |
| Country | US |
| Kind code | B1 |
| Filing date | Sep 5, 2014 |
| Priority date | Sep 5, 2014 |
| Publication date | Dec 6, 2016 |
| Grant date | Dec 6, 2016 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method and system for low-memory footprint fingerprinting and indexing for efficiently measuring document similarity and containment are described. A method may include extracting, by a processor, content from a set of one or more data files. The method may also determine a size of the content and apply a hash function to the content to generate multiple hashes. The method selects a constrained set of the hashes to generate a fixed-size fingerprint representative of the content when the size of the content is greater than a threshold size. The method stores the fixed-size fingerprint representative of the content in an endpoint index for at least partial file content matching by an endpoint device. The method may employ a statistical-based optimization to speed-up query time.
Opening claim text (preview).
What is claimed is: 1. A method for use in data loss prevention comprising: extracting, by a processor, content from a set of one or more data files to be protected by a data loss prevention (DLP) policy designed to detect resemblance or containment relationships between two files even when their contents are not exact matches, wherein the content comprises sensitive data, and wherein the set of one or more data files comprises both sensitive data and non-sensitive data; determining, by the processor, a size of the content; applying, by the processor, a hash function to the content to generate a plurality of hashes; selecting, by the processor, a constrained set of the plurality of hashes to generate a fixed-size fingerprint representative of the content when the size of the content is greater than a threshold size; determining, by the processor, a quantity of hashes based on the size of the content when the size of the content is equal to or less than the threshold size, wherein the quantity is proportional to the size of the content and the determined quantity is less than the quantity of the plurality of hashes; selecting, by the processor, the determined quantity of hashes from the plurality of hashes to generate a limited-size fingerprint representative of the content when the size of the content is equal to or less than the threshold size; and storing the fixed-size fingerprint or the limited-size fingerprint in an endpoint index for at least partial file content matching by an endpoint device, wherein the endpoint index comprises at least one fixed-size fingerprint or limited-size fingerprint corresponding to sensitive data, and wherein the endpoint index may be used for querying fingerprints of other data sources with the fingerprints contained in the endpoint index to determine if the other data sources contain content corresponding to the extracted content. 2. The method of claim 1 , wherein the endpoint index comprises a mapping of at least some of the selected hashes in the stored fixed-size fingerprint or stored limited-size fingerprint to the set of one or more data files to be protected that also include any of the plurality of fixed-size fingerprints or the limited-sized fingerprints representative of the content. 3. The method of claim 1 , further comprising: applying, by the processor, a second hash function to each of the set of one or more data files to generate an exact-file fingerprint for use with a DLP further designed to detect exact matches, wherein the hash function comprises generating exact-file signatures of the set of one or more data files; and storing the exact-file fingerprints in the endpoint index for exact-file matching by the endpoint device. 4. The method of claim 3 , wherein the generating the exact-file fingerprints of the set of one or more data files comprises: determining whether the content extracted comprises text data or non-text data; applying a cryptographic hash function to the non-text data; and applying the hash function to the text data. 5. The method of claim 1 , further comprising: normalizing the content to generate a plurality of alpha numeric characters; and applying a k-gram operation to the plurality of alpha numeric characters, wherein applying the hash function comprises applying a rolling hash function. 6. The method of claim 1 , further comprising: extracting statistical information to model a distribution of the quantity of hashes from the fingerprints stored in the endpoint index; and selecting a query threshold based on the distribution, wherein the query threshold is selected by determining the maximum value of a subset of the smallest hashes of the hashes stored in the endpoint index. 7. A system comprising: a memory; and a processor coupled with the memory, the processor to: extract content from a set of one or more data files to be protected by a data loss prevention (DLP) policy designed to detect resemblance or containment relationships between two files even when their contents are not exact matches, wherein the content comprises sensitive data and wherein the set of one or more data files comprises both sensitive data and non-sensitive data; determine a size of the content; apply a hash function to the content to generate a plurality of hashes; select a constrained set of the plurality of hashes to generate a fixed-size fingerprint representative of the content when the size of the content is greater than a threshold size; determine a quantity of hashes based on the size of the content when the size of the content is equal to or less than the threshold size, wherein the quantity is proportional to the size of the content and the determined quantity is less than the quantity of the plurality of hashes; select the determined quantity of hashes from the plurality of hashes to generate a limited-size fingerprint representative of the content when the size of the content is equal to or less than the threshold size; and store the fixed-size fingerprint or the limited-size fingerprint in an endpoint index for at least partial file content matching by an endpoint device, wherein the endpoint index comprises at least one fixed-size fingerprint or limited-size fingerprint corresponding to sensitive data and wherein the endpoint index may be used for querying fingerprints of other data sources with the fingerprints contained in the endpoint index to determine if the other data sources contain content corresponding to the extracted content. 8. The system of claim 7 , wherein the endpoint index comprises a mapping of at least some of the selected hashes in the stored fixed-size fingerprint or stored limited-size fingerprint to a plurality of data files that also include any of the plurality of fixed-size fingerprints or the limited-sized fingerprints representative of the content. 9. The system of claim 7 , wherein the processor is further to: apply a second hash function to each of the set of one or more data files to generate an exact-file fingerprint, wherein the second hash function comprises generating exact-file signatures of the set of one or more data files for use with a DLP further designed to detect exact matches; and store the exact-file fingerprints in the endpoint index for exact-file matching by the endpoint device. 10. The system of claim 9 , wherein the processor is further to: determine whether the content extracted comprises text data or non-text data; apply a cryptographic hash function to the non-text data; and apply the hash function to the text data. 11. The system of claim 7 , wherein the processor is further to: normalize the content to generate a plurality of alpha numeric characters; and apply a k-gram operation to the plurality of alpha numeric characters; and apply a rolling hash function as the hash function. 12. The system of claim 7 , wherein the processor is further to: extract statistical information to model a distribution of the quantity of hashes from the fingerprints stored in the endpoint index; and select a query threshold based on the distribution, wherein the query threshold is selected by determining the maximum value of a subset of the smallest hashes of the hashes stored in the endpoint index. 13. A non-transitory computer readable storage medium including instructions that, when executed by a processor, cause the processor to perform operations comprising: extracting content from a set of one or more data files to be protected by a data loss prevention (DLP) policy designed to detect resemblance or containment relationships between two files even when their contents are not exact matches, wherein the content c
Tools and structures for managing or administering access control systems · CPC title
Physics · mapped topic
Protecting data · CPC title
Physics · mapped topic
using file content signatures, e.g. hash values · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.