What technology area does this patent fall under?

Primary CPC classification G06F16/322. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 05 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Probabilistic indexing of textual data

US11138246B1 · US · B1

Patent metadata
Field	Value
Publication number	US-11138246-B1
Application number	US-201615194339-A
Country	US
Kind code	B1
Filing date	Jun 27, 2016
Priority date	Jun 27, 2016
Publication date	Oct 5, 2021
Grant date	Oct 5, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for searching a corpus of textual data using probabilistic data structures are described herein. The corpus of textual data is indexed using the probabilistic data structure on a piece-by-piece basis and the pieces are combined so that the textual data can be searched. The search results are returned, indicating a likelihood that the data item is in the textual data.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: receiving a set of log data, the set of log data including textual data; extracting a set of strings from the textual data, based at least in part on a chunk size parameter; processing a subset of strings of the set of strings by at least: normalizing the subset of strings by at least removing a first subset of characters of a set of characters representing data included in the subset of strings to generate a normalized subset of strings; and generating a trie, based at least in part on a trie depth and the chunk size parameter, indexing a first string of the normalized subset of strings by at least inserting into a first node of the trie a first character of the first string and inserting into a set of child nodes of the first node a set of substrings of the first string corresponding to the first character, the set of substrings having a substring length not more than the trie depth; performing at least one merging operation of a set of merging operations to generate a tree of tries, of which the trie is a member, until a false positive metric passes a threshold, the false positive metric indicating a likelihood that a query of the set of log data returns a result that indicates a log entry is a member of the tree of tries despite not being a member, by at least: increasing the trie depth associated with the tree of tries; and performing a union of the set of child nodes of the trie and at least one other set of child nodes of at least one other trie of a set of tries included in the tree of tries; in response to the false positive metric passing the threshold, searching for the log entry in the tree of tries; and reporting information that indicates whether the log entry is in the tree of tries. 2. The computer-implemented method of claim 1 , wherein the chunk size parameter corresponds to a single log entry. 3. The computer-implemented method of claim 1 , wherein as a result of the information indicating that the log entry is in the tree of tries, searching for the log entry in log data indexed by a trie in the tree of tries. 4. The computer-implemented method of claim 1 , wherein the tree of tries includes a plurality of trees of tries, a first tree of tries of the plurality of trees of tries including the set of tries and selected based at least in part on a time interval associated with the first tree of tries. 5. A system, comprising: one or more processors; and memory that stores computer-executable instructions that, as a result of being executed, cause the one or more processors to: extract a subset of textual data of a set of textual data based at least in part on a chunk size parameter; normalize the subset of textual data by at least modifying an alphabet representing data included in the set of textual data to generate a normalized subset of textual data; generate a first probabilistic data structure that indexes the normalized subset of textual data, the first probabilistic data structure generated based at least in part on the normalized subset of textual data and the chunk size parameter by at least inserting a first string of the normalized subset of textual data into a root node and a substring of the first string into a child node; perform at least one merging operation of a set of merging operations to combine a set of first probabilistic data structures of which the first probabilistic data structure is a member into a second probabilistic data structure that indexes the normalized subset of textual data, such that a false positive metric passes a lower bound of a likelihood that a search for a data item returns information that indicates the data item is a member of the set of textual data despite not being a member, by at least: increasing a compression rate of the second probabilistic data structure, where the compression rate is determined based at least in part on a depth parameter associated with the second probabilistic data structure, the depth parameter indicating a substring length of nodes of the second probabilistic data structure, where increasing the substring length causes the compression rate to pass the lower bound by at least reducing the likelihood that the search for the data item returns information that indicates the data item is in the set of textual data despite not being in the set of textual data; and combining the set of first probabilistic data structures by at least merging nodes of the set of first probabilistic data structures based at least in part on the depth parameter; receive a request to search for the data item in the set of textual data; and provide a response to the request based at least in part on a result of searching for the data item in the second probabilistic data structure. 6. The system of claim 5 , wherein the first probabilistic data structure is a trie. 7. The system of claim 5 , wherein the second probabilistic data structure is a tree of the first probabilistic data structure. 8. The system of claim 5 , wherein the set of textual data is a set of log data. 9. The system of claim 5 , wherein the set of textual data is a set of streaming data. 10. The system of claim 5 , wherein the memory further includes instructions that, as a result of being executed, cause the one or more processors to discard the first probabilistic data structure of the set of first probabilistic data structures after combining the first probabilistic data structure into the second probabilistic data structure. 11. The system of claim 10 , wherein the likelihood that the search for the data item returns the information that indicates the data item is within the set of textual data despite not being within the set of textual data is further based at least in part on the false positive metric of the second probabilistic data structure. 12. The system of claim 5 , wherein the second probabilistic data structure includes metadata indicating a frequency count of elements in the second probabilistic data structure, the frequency count based at least in part on a number of occurrences of elements in the set of textual data. 13. A non-transitory computer-readable storage medium storing thereon executable instructions that, as a result of being executed by one or more processors of a computer system, cause the computer system to at least: generate a normalized set of textual data by at least modifying an alphabet corresponding to a set of textual data; generate a tree of probabilistic data structures that indexes the normalized set of textual data by at least inserting into a root node a first string of the normalized set of textual data and into a child node a substring of the first string, where the tree of probabilistic data structures has a first depth determined based at least in part on a first substring length of the first string, members of the tree of probabilistic data structures comprising substrings of the normalized set of textual data of the first substring length; generate an indication that a search string is in the set of textual data by at least searching the tree of probabilistic data structures to determine the search string is in the set of textual data; determine, based at least in part on the indication that the search string is in the set of textual data, a set of locations in the set of textual data of the search string; determine a set of false positive results based at least in part on searching the set of locations in the set of textual data for the search string; reduce the set of false positive results until the set of values passes a threshold indicating a likelihood that a search of the tree of prob

Assignees

Amazon Tech Inc

Inventors

Waugh Robert Mark

Classifications

G06F16/3346
using probabilistic model · CPC title
G06F16/322Primary
Trees · CPC title

Patent family

Related publications grouped by family.

View patent family 77923785

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11138246B1 cover?: Techniques for searching a corpus of textual data using probabilistic data structures are described herein. The corpus of textual data is indexed using the probabilistic data structure on a piece-by-piece basis and the pieces are combined so that the textual data can be searched. The search results are returned, indicating a likelihood that the data item is in the textual data.
Who is the assignee on this patent?: Amazon Tech Inc
What technology area does this patent fall under?: Primary CPC classification G06F16/322. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 05 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Deletion of elements from a probabilistic data structure

Classifying urls

Online compression for limited sequence length radix tree

Frequently asked questions