Systems and methods for searching unstructured documents for structured data

US9971809B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9971809-B1
Application numberUS-201514868334-A
CountryUS
Kind codeB1
Filing dateSep 28, 2015
Priority dateSep 28, 2015
Publication dateMay 15, 2018
Grant dateMay 15, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosed computer-implemented method for searching unstructured documents for structured data may include (1) receiving a request to search unstructured documents for a document that contains data (e.g., sensitive data) from a structured dataset, (2) generating a secure search index (e.g., a Bloom filter) for searching the unstructured documents for the sensitive data, (3) extracting a first token and a second token from an unstructured document, (4) generating a hashed key from the first token and the second token, (5) querying the secure search index to determine whether the second hashed key is contained in the secure search index, and (6) responding, upon determining that the second hashed key is contained in the secure search index, to the request with information about the unstructured document. Various other methods, systems, and computer-readable media are also disclosed.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for searching unstructured documents for structured data, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising: receiving a request to search unstructured documents for a document that contains: a value from a first field of a dataset; and a value from a second field of the dataset; generating a secure search index for searching the unstructured documents by, for each record in the dataset: identifying, within the dataset, the record's value from the first field and the record's value from the second field; generating a first hashed key from the record's value from the first field and the record's value from the second field; adding the first hashed key to the secure search index; extracting a first token and a second token from an unstructured document; generating a second hashed key from the first token and the second token; querying the secure search index to determine whether the second hashed key is contained in the secure search index; responding, upon determining that the second hashed key is contained in the secure search index, to the request with information about the unstructured document. 2. The computer-implemented method of claim 1 , wherein: values from the first field follow a known pattern; the request for the document specifies that the value from the first field is required to be within a specified distance from the value from the second field; extracting the first token and the second token from the unstructured document comprises: using the known pattern to identify the first token within the unstructured document; identifying the second token within the specified distance from the first token. 3. The computer-implemented method of claim 1 , wherein: receiving the request to search unstructured documents for the document comprises receiving a request to search unstructured documents for a document that contains: the value from the first field of the dataset; the value from the second field of the dataset; and a value from a third field of the dataset, wherein: values from the first field follow a known pattern; values from the second field and values from the third field do not follow a known pattern; the computer-implemented method further comprises: generating an additional secure search index by, for each record in the dataset: identifying, within the dataset, the record's value from the first field and the record's value from the third field; generating a third hashed key from the record's value from the first field and the record's value from the third field; adding the third hashed key to the additional secure search index; extracting a third token from the unstructured document; generating a fourth hashed key from the first token and the third token; querying the additional secure search index to determine whether the fourth hashed key is contained in the additional secure search index; responding to the request with information about the unstructured document occurs upon determining that the fourth hashed key is contained in the additional secure search index. 4. The computer-implemented method of claim 1 , wherein: receiving the request to search unstructured documents for the document comprises receiving a request to search unstructured documents for a document that contains: the value from the first field of the dataset; the value from the second field of the dataset; and a value from a third field of the dataset, wherein: values from the first field follow a first known pattern; values from the third field follow a second known pattern; values from the second field do not follow a known pattern; the computer-implemented method further comprises: generating an additional secure search index by, for each record in the dataset: identifying, within the dataset, the record's value from the second field and the record's value from the third field; generating a third hashed key from the record's value from the second field and the record's value from the third field; adding the third hashed key to the additional secure search index; extracting a third token from the unstructured document; generating a fourth hashed key from the second token and the third token; querying the additional secure search index to determine whether the fourth hashed key is contained in the additional secure search index; responding to the request with information about the unstructured document occurs upon determining that the fourth hashed key is contained in the additional secure search index. 5. The computer-implemented method of claim 1 , wherein: the first hashed key is generated from the record's value from the first field, the record's value from the second field, and a cryptographic key; the second hashed key is generated from the first token, the second token, and the cryptographic key. 6. The computer-implemented method of claim 1 , wherein the secure search index comprises a Bloom filter. 7. The computer-implemented method of claim 1 , wherein generating the first hashed key comprises: generating an intermediate value from a combination of the record's value from the first field and the record's value from the second field; hashing the intermediate value to produce the hashed key. 8. The computer-implemented method of claim 1 , wherein: the step of generating the secure search index for searching the unstructured documents is performed at a server-side computing device; the steps of extracting the first token and the second token, generating the second hashed key, and querying the secure search index are performed at a client-side computing device to which the secure search index has been distributed. 9. The computer-implemented method of claim 1 , wherein: values from the first field follow a known pattern; extracting the first token from the unstructured document comprises using a regular expression based on the known pattern to identify the first token within the unstructured document. 10. The computer-implemented method of claim 1 , wherein: at least the first field of the dataset comprises sensitive data; the first field of the dataset comprises at least one of: social security numbers; account numbers; credit card numbers. 11. A system for searching unstructured documents for structured data, the system comprising: a receiving module, stored in memory, that receives a request to search unstructured documents for a document that contains: a value from a first field of a dataset; and a value from a second field of the dataset; an index-generating module, stored in memory, that generates a secure search index for searching the unstructured documents by, for each record in the dataset: identifying, within the dataset, the record's value from the first field and the record's value from the second field; generating a first hashed key from the record's value from the first field and the record's value from the second field; adding the first hashed key to the secure search index; an extracting module, stored in memory, that extracts a first token and a second token from an unstructured document; a key-generating module, stored in memory, that generates a second hashed key from the first token and the second token; a querying module, stored in memory, that queries the secure search index to determine whether the second hashed key is contained in the secure search index; a responding module, stored in memory, that responds, upon determining that the second hashed key is contained in the secure search index, to the request with informa

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9971809B1 cover?
The disclosed computer-implemented method for searching unstructured documents for structured data may include (1) receiving a request to search unstructured documents for a document that contains data (e.g., sensitive data) from a structured dataset, (2) generating a secure search index (e.g., a Bloom filter) for searching the unstructured documents for the sensitive data, (3) extracting a fir…
Who is the assignee on this patent?
Symantec Corp
What technology area does this patent fall under?
Primary CPC classification G06F17/30477. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 15 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).