Multi-strategy compression scheme
US-11861292-B2 · Jan 2, 2024 · US
US9760546B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9760546-B2 |
| Application number | US-201313901736-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 24, 2013 |
| Priority date | May 24, 2013 |
| Publication date | Sep 12, 2017 |
| Grant date | Sep 12, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system and method of identifying repeat subsequences having at least a value of x for threshold of different left contexts and a value of y for a threshold of different right contexts for an input sequence are disclosed. The method may include generating a lexicographically sorted suffix array for the input sequence and a longest common prefix array. The suffix array is traversed in lexicographic order comparing the longest common prefix values between consecutive suffixes. Suffixes with the same longest common prefix are representative of occurrence of the same repeat, a higher longest common prefix indicates a new occurrence of a longer repeat, and a lower longest common prefix indicates the last occurrence of a repeat.
Opening claim text (preview).
What is claimed is: 1. A method of identifying repeat subsequences of symbols in a sequence of symbols comprising: receiving a sequence of symbols drawn from an alphabet, the sequence of symbols being generated from a collection of at least two documents; defining values of x and y, where x corresponds to a threshold number of different left contexts for a given repeat subsequence in the sequence and y corresponds to a threshold number of different right contexts for a given repeat subsequence in the sequence to be identified as context diverse; generating a lexicographically sorted array of suffixes in the sequence; computing a longest common prefix array, each value in the longest common prefix array representing a length, in symbols, of a longest common prefix that occurs in an adjacent pair of suffixes in the lexicographically sorted array; sequentially comparing pairs of first and second longest common prefix values that are sequential in the longest common prefix array and, based on the comparison, identifying at least one of: an occurrence of a same repeat subsequence, when the compared first and second longest common prefix values are the same; a new occurrence of a longer repeat subsequence when the second of the compared longest common prefix values is higher; and a last occurrence of a given repeat subsequence when the second of the compared longest common prefix values is lower; based on the sequential comparisons, identifying context diverse repeat subsequences in the sequence, each of the identified context diverse repeat subsequences having at least the defined threshold number of different left and right contexts; and outputting information based on the identified context diverse repeat subsequences, the output information including at least one of: a set of repeat subsequences which includes the identified context diverse repeat subsequences; a vector spaced representation of a document in the collection, the vector spaced representation including indices, each index representing a number of times a respective one of a set of repeat subsequences appears in the document, the set of repeat subsequences including the identified context diverse repeat subsequences; a label for a document in the collection, the label being based on the vector spaced representation of the document; for each of a set of topics, a set of most probable repeat subsequences from a set of repeat subsequences which includes the identified context diverse repeat subsequences; and a cluster of similar documents in the collection, or a set of documents similar to a selected document, based on a measure of similarity between respective vector spaced representations; wherein at least one of the generating, computing, comparing, and identifying is performed by a computer processor. 2. The method of claim 1 , wherein the sequential comparison comprises: based on the values in the longest common prefix array, constructing a set of left contexts by inserting in the set a symbol appearing in the sequence, which precedes a suffix; based on a pair of sequential values in the longest common prefix array, incrementing a count of right contexts; and outputting context diverse repeat subsequences having a cardinality of the left context set of at least x and a count of at least y. 3. The method of claim 1 , wherein the sequential comparison comprises comparing a pair of sequential values in the longest common prefix array comprising a first and a second value, and if the second value is less than the first value, the left context set and the right context count of an occurrence of a repeat subsequence at a corresponding location in the suffix array are computed, and if the cardinality is greater than value x and the count greater than value y, outputting the occurrence. 4. The method of claim 3 , wherein if the second value is greater than the first value, the method comprises adding an occurrence of a new repeat to a data structure. 5. The method of claim 1 , wherein the symbols correspond to at least one of the group consisting of: single characters of an alphabet that includes letters; words in at least one document in a natural language; and part of speech tags assigned to words at least one document in a natural language. 6. The method of claim 1 , wherein x and y have different values. 7. The method of claim 1 , wherein at least one of x and y is at least 2 and wherein the other of x and y is greater than 2. 8. The method of claim 1 , further comprising computing a value for at least one of x and y based on a training set of sequences of symbols. 9. The method of claim 1 , further comprising providing for a user to select a value for at least one of x and y. 10. The method of claim 1 , further comprising generating a representation of at least one document in a collection of documents from which the sequence is extracted based on occurrences of the identified repeat subsequences. 11. A method of identifying repeat subsequences of symbols in a sequence of symbols comprising: receiving a sequence of symbols drawn from an alphabet; defining values of x and y, where x corresponds to a threshold number of different left contexts for a given repeat subsequence in the sequence and y corresponds to a threshold number of different right contexts for a given repeat subsequence in the sequence to be identified as context diverse; generating a lexicographically sorted array of suffixes in the sequence; computing a longest common prefix array, each value in the longest common prefix array representing a length, in symbols, of a longest common prefix that occurs in an adjacent pair of suffixes in the lexicographically sorted array; sequentially comparing pairs of first and second sequential longest common prefix values and, based on the comparison, identifying at least one of: an occurrence of a same repeat subsequence, when the compared first and second longest common prefix values are the same; a new occurrence of a longer repeat subsequence when the second of the compared longest common prefix values is higher; and a last occurrence of a given repeat subsequence when the second of the compared longest common prefix values is lower, wherein the sequential comparison comprises: comparing a pair of sequential values in the longest common prefix array comprising a first and a second value, and if the second value is less than the first value, the left context set and the right context count of an occurrence of a repeat subsequence at a corresponding location in the suffix array are computed; computing the left context set and the right context count for all subsequences of the occurrence which have a length greater than the first longest common prefix value; and for each subsequence, outputting the subsequence if the corresponding cardinality of the left context set is greater than value x and right context count is greater than value y; wherein at least one of the generating, computing, comparing, and identifying is performed by a computer processor. 12. A method of identifying repeat subsequences of symbols in a sequence of symbols comprising: receiving a sequence of symbols drawn from an alphabet; defining values of x and y, where x corresponds to a threshold number of different left contexts for a given repeat subsequence in the sequence and y corresponds to a threshold number of different right contexts for a given repeat subsequence in the sequence to be identified as context diverse; generating a lexicographically sorted array of suffixes in the sequence; computing a longest common prefix array, each value in the longest common prefix array representing a le
Related publications grouped by family.
Answers are generated from the same data shown on this page.