Deletion of elements from a probabilistic data structure
US-2017068727-A1 · Mar 9, 2017 · US
US11138246B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-11138246-B1 |
| Application number | US-201615194339-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jun 27, 2016 |
| Priority date | Jun 27, 2016 |
| Publication date | Oct 5, 2021 |
| Grant date | Oct 5, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for searching a corpus of textual data using probabilistic data structures are described herein. The corpus of textual data is indexed using the probabilistic data structure on a piece-by-piece basis and the pieces are combined so that the textual data can be searched. The search results are returned, indicating a likelihood that the data item is in the textual data.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method, comprising: receiving a set of log data, the set of log data including textual data; extracting a set of strings from the textual data, based at least in part on a chunk size parameter; processing a subset of strings of the set of strings by at least: normalizing the subset of strings by at least removing a first subset of characters of a set of characters representing data included in the subset of strings to generate a normalized subset of strings; and generating a trie, based at least in part on a trie depth and the chunk size parameter, indexing a first string of the normalized subset of strings by at least inserting into a first node of the trie a first character of the first string and inserting into a set of child nodes of the first node a set of substrings of the first string corresponding to the first character, the set of substrings having a substring length not more than the trie depth; performing at least one merging operation of a set of merging operations to generate a tree of tries, of which the trie is a member, until a false positive metric passes a threshold, the false positive metric indicating a likelihood that a query of the set of log data returns a result that indicates a log entry is a member of the tree of tries despite not being a member, by at least: increasing the trie depth associated with the tree of tries; and performing a union of the set of child nodes of the trie and at least one other set of child nodes of at least one other trie of a set of tries included in the tree of tries; in response to the false positive metric passing the threshold, searching for the log entry in the tree of tries; and reporting information that indicates whether the log entry is in the tree of tries. 2. The computer-implemented method of claim 1 , wherein the chunk size parameter corresponds to a single log entry. 3. The computer-implemented method of claim 1 , wherein as a result of the information indicating that the log entry is in the tree of tries, searching for the log entry in log data indexed by a trie in the tree of tries. 4. The computer-implemented method of claim 1 , wherein the tree of tries includes a plurality of trees of tries, a first tree of tries of the plurality of trees of tries including the set of tries and selected based at least in part on a time interval associated with the first tree of tries. 5. A system, comprising: one or more processors; and memory that stores computer-executable instructions that, as a result of being executed, cause the one or more processors to: extract a subset of textual data of a set of textual data based at least in part on a chunk size parameter; normalize the subset of textual data by at least modifying an alphabet representing data included in the set of textual data to generate a normalized subset of textual data; generate a first probabilistic data structure that indexes the normalized subset of textual data, the first probabilistic data structure generated based at least in part on the normalized subset of textual data and the chunk size parameter by at least inserting a first string of the normalized subset of textual data into a root node and a substring of the first string into a child node; perform at least one merging operation of a set of merging operations to combine a set of first probabilistic data structures of which the first probabilistic data structure is a member into a second probabilistic data structure that indexes the normalized subset of textual data, such that a false positive metric passes a lower bound of a likelihood that a search for a data item returns information that indicates the data item is a member of the set of textual data despite not being a member, by at least: increasing a compression rate of the second probabilistic data structure, where the compression rate is determined based at least in part on a depth parameter associated with the second probabilistic data structure, the depth parameter indicating a substring length of nodes of the second probabilistic data structure, where increasing the substring length causes the compression rate to pass the lower bound by at least reducing the likelihood that the search for the data item returns information that indicates the data item is in the set of textual data despite not being in the set of textual data; and combining the set of first probabilistic data structures by at least merging nodes of the set of first probabilistic data structures based at least in part on the depth parameter; receive a request to search for the data item in the set of textual data; and provide a response to the request based at least in part on a result of searching for the data item in the second probabilistic data structure. 6. The system of claim 5 , wherein the first probabilistic data structure is a trie. 7. The system of claim 5 , wherein the second probabilistic data structure is a tree of the first probabilistic data structure. 8. The system of claim 5 , wherein the set of textual data is a set of log data. 9. The system of claim 5 , wherein the set of textual data is a set of streaming data. 10. The system of claim 5 , wherein the memory further includes instructions that, as a result of being executed, cause the one or more processors to discard the first probabilistic data structure of the set of first probabilistic data structures after combining the first probabilistic data structure into the second probabilistic data structure. 11. The system of claim 10 , wherein the likelihood that the search for the data item returns the information that indicates the data item is within the set of textual data despite not being within the set of textual data is further based at least in part on the false positive metric of the second probabilistic data structure. 12. The system of claim 5 , wherein the second probabilistic data structure includes metadata indicating a frequency count of elements in the second probabilistic data structure, the frequency count based at least in part on a number of occurrences of elements in the set of textual data. 13. A non-transitory computer-readable storage medium storing thereon executable instructions that, as a result of being executed by one or more processors of a computer system, cause the computer system to at least: generate a normalized set of textual data by at least modifying an alphabet corresponding to a set of textual data; generate a tree of probabilistic data structures that indexes the normalized set of textual data by at least inserting into a root node a first string of the normalized set of textual data and into a child node a substring of the first string, where the tree of probabilistic data structures has a first depth determined based at least in part on a first substring length of the first string, members of the tree of probabilistic data structures comprising substrings of the normalized set of textual data of the first substring length; generate an indication that a search string is in the set of textual data by at least searching the tree of probabilistic data structures to determine the search string is in the set of textual data; determine, based at least in part on the indication that the search string is in the set of textual data, a set of locations in the set of textual data of the search string; determine a set of false positive results based at least in part on searching the set of locations in the set of textual data for the search string; reduce the set of false positive results until the set of values passes a threshold indicating a likelihood that a search of the tree of prob
using probabilistic model · CPC title
Trees · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.