Identifying repeat subsequences by left and right contexts

US9760546B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9760546-B2
Application numberUS-201313901736-A
CountryUS
Kind codeB2
Filing dateMay 24, 2013
Priority dateMay 24, 2013
Publication dateSep 12, 2017
Grant dateSep 12, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method of identifying repeat subsequences having at least a value of x for threshold of different left contexts and a value of y for a threshold of different right contexts for an input sequence are disclosed. The method may include generating a lexicographically sorted suffix array for the input sequence and a longest common prefix array. The suffix array is traversed in lexicographic order comparing the longest common prefix values between consecutive suffixes. Suffixes with the same longest common prefix are representative of occurrence of the same repeat, a higher longest common prefix indicates a new occurrence of a longer repeat, and a lower longest common prefix indicates the last occurrence of a repeat.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of identifying repeat subsequences of symbols in a sequence of symbols comprising: receiving a sequence of symbols drawn from an alphabet, the sequence of symbols being generated from a collection of at least two documents; defining values of x and y, where x corresponds to a threshold number of different left contexts for a given repeat subsequence in the sequence and y corresponds to a threshold number of different right contexts for a given repeat subsequence in the sequence to be identified as context diverse; generating a lexicographically sorted array of suffixes in the sequence; computing a longest common prefix array, each value in the longest common prefix array representing a length, in symbols, of a longest common prefix that occurs in an adjacent pair of suffixes in the lexicographically sorted array; sequentially comparing pairs of first and second longest common prefix values that are sequential in the longest common prefix array and, based on the comparison, identifying at least one of: an occurrence of a same repeat subsequence, when the compared first and second longest common prefix values are the same; a new occurrence of a longer repeat subsequence when the second of the compared longest common prefix values is higher; and a last occurrence of a given repeat subsequence when the second of the compared longest common prefix values is lower; based on the sequential comparisons, identifying context diverse repeat subsequences in the sequence, each of the identified context diverse repeat subsequences having at least the defined threshold number of different left and right contexts; and outputting information based on the identified context diverse repeat subsequences, the output information including at least one of: a set of repeat subsequences which includes the identified context diverse repeat subsequences; a vector spaced representation of a document in the collection, the vector spaced representation including indices, each index representing a number of times a respective one of a set of repeat subsequences appears in the document, the set of repeat subsequences including the identified context diverse repeat subsequences; a label for a document in the collection, the label being based on the vector spaced representation of the document; for each of a set of topics, a set of most probable repeat subsequences from a set of repeat subsequences which includes the identified context diverse repeat subsequences; and a cluster of similar documents in the collection, or a set of documents similar to a selected document, based on a measure of similarity between respective vector spaced representations; wherein at least one of the generating, computing, comparing, and identifying is performed by a computer processor. 2. The method of claim 1 , wherein the sequential comparison comprises: based on the values in the longest common prefix array, constructing a set of left contexts by inserting in the set a symbol appearing in the sequence, which precedes a suffix; based on a pair of sequential values in the longest common prefix array, incrementing a count of right contexts; and outputting context diverse repeat subsequences having a cardinality of the left context set of at least x and a count of at least y. 3. The method of claim 1 , wherein the sequential comparison comprises comparing a pair of sequential values in the longest common prefix array comprising a first and a second value, and if the second value is less than the first value, the left context set and the right context count of an occurrence of a repeat subsequence at a corresponding location in the suffix array are computed, and if the cardinality is greater than value x and the count greater than value y, outputting the occurrence. 4. The method of claim 3 , wherein if the second value is greater than the first value, the method comprises adding an occurrence of a new repeat to a data structure. 5. The method of claim 1 , wherein the symbols correspond to at least one of the group consisting of: single characters of an alphabet that includes letters; words in at least one document in a natural language; and part of speech tags assigned to words at least one document in a natural language. 6. The method of claim 1 , wherein x and y have different values. 7. The method of claim 1 , wherein at least one of x and y is at least 2 and wherein the other of x and y is greater than 2. 8. The method of claim 1 , further comprising computing a value for at least one of x and y based on a training set of sequences of symbols. 9. The method of claim 1 , further comprising providing for a user to select a value for at least one of x and y. 10. The method of claim 1 , further comprising generating a representation of at least one document in a collection of documents from which the sequence is extracted based on occurrences of the identified repeat subsequences. 11. A method of identifying repeat subsequences of symbols in a sequence of symbols comprising: receiving a sequence of symbols drawn from an alphabet; defining values of x and y, where x corresponds to a threshold number of different left contexts for a given repeat subsequence in the sequence and y corresponds to a threshold number of different right contexts for a given repeat subsequence in the sequence to be identified as context diverse; generating a lexicographically sorted array of suffixes in the sequence; computing a longest common prefix array, each value in the longest common prefix array representing a length, in symbols, of a longest common prefix that occurs in an adjacent pair of suffixes in the lexicographically sorted array; sequentially comparing pairs of first and second sequential longest common prefix values and, based on the comparison, identifying at least one of: an occurrence of a same repeat subsequence, when the compared first and second longest common prefix values are the same; a new occurrence of a longer repeat subsequence when the second of the compared longest common prefix values is higher; and a last occurrence of a given repeat subsequence when the second of the compared longest common prefix values is lower, wherein the sequential comparison comprises: comparing a pair of sequential values in the longest common prefix array comprising a first and a second value, and if the second value is less than the first value, the left context set and the right context count of an occurrence of a repeat subsequence at a corresponding location in the suffix array are computed; computing the left context set and the right context count for all subsequences of the occurrence which have a length greater than the first longest common prefix value; and for each subsequence, outputting the subsequence if the corresponding cardinality of the left context set is greater than value x and right context count is greater than value y; wherein at least one of the generating, computing, comparing, and identifying is performed by a computer processor. 12. A method of identifying repeat subsequences of symbols in a sequence of symbols comprising: receiving a sequence of symbols drawn from an alphabet; defining values of x and y, where x corresponds to a threshold number of different left contexts for a given repeat subsequence in the sequence and y corresponds to a threshold number of different right contexts for a given repeat subsequence in the sequence to be identified as context diverse; generating a lexicographically sorted array of suffixes in the sequence; computing a longest common prefix array, each value in the longest common prefix array representing a le

Assignees

Inventors

Classifications

  • G06F40/12Primary

    Use of codes for handling textual entities · CPC title

  • Converting codes to words; Guess-ahead of partial word inputs · CPC title

  • G06F17/22Primary

    Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9760546B2 cover?
A system and method of identifying repeat subsequences having at least a value of x for threshold of different left contexts and a value of y for a threshold of different right contexts for an input sequence are disclosed. The method may include generating a lexicographically sorted suffix array for the input sequence and a longest common prefix array. The suffix array is traversed in lexicogra…
Who is the assignee on this patent?
Xerox Corp
What technology area does this patent fall under?
Primary CPC classification G06F40/12. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 12 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).