Generation of textual documents with reduced de Bruijn graphs

US9740679B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9740679-B2
Application numberUS-201514962506-A
CountryUS
Kind codeB2
Filing dateDec 8, 2015
Priority dateDec 8, 2015
Publication dateAug 22, 2017
Grant dateAug 22, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for generating an output sequence includes receiving an input sequence of symbols. An output sequence is generated from a reduced directed graph derived from n-gram statistics for a corpus sequence of symbols. The graph includes nodes connected by edges that are labeled with a sequence of symbols and associated with a multiplicity representing a number of occurrences of the sequence of symbols in the corpus sequence. Each path through the graph where each edge is traversed its multiplicity of times reconstructs the corpus sequence. The sequences of symbols in the reduced graph vary in number of symbols. The output sequence from the first iteration, and optionally also an output sequence from at least one subsequent iteration, is/are output. The output sequence may be proposed to an author to assist in generating a document.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for generating an output sequence comprising: providing a reduced directed graph derived from n-gram statistics for a corpus sequence of symbols, the reduced directed graph including a set of nodes connected by edges, each of the edges being labeled with a respective label consisting of a respective sequence of symbols, each of the edges being associated with a multiplicity representing a number of occurrences of the respective sequence of symbols in the corpus sequence, and wherein the sequences of symbols labeling the edges in the reduced directed graph vary in number of symbols, whereby at least one Eulerian cycle through the reduced directed graph traverses each edge a number of times corresponding to the multiplicity of the edge; receiving an input sequence of symbols; with a processor, for a first iteration, generating a first output sequence as a candidate for appending to the input sequence, comprising: identifying a set of the labels of the edges in the reduced directed graph, the identified set of labels each including a prefix and a suffix, the prefix including at least a last symbol of the input sequence, the suffix comprising at least one subsequent symbol to the prefix in the label; sampling a label from the identified set of labels with a probability which is a function of the multiplicity of the edges that are labeled with the labels in the set of labels, and generating the first output sequence from the suffix of the sampled label; outputting, to a graphical user interface, a candidate sequence for appending to the input sequence, comprising outputting at least one of: the first output sequence from the first iteration; and an extended output sequence, the extended output sequence including the first output sequence from the first iteration and a respective subsequent output sequence from each of at least one subsequent iteration appended thereto, wherein in each of the at least one subsequent iteration, the subsequent output sequence is generated from the reduced directed graph as a candidate for appending to an extended input sequence, the extended input sequence for the subsequent iteration being a document sequence generated by appending the first output sequence and each subsequent output sequence of each prior subsequent iteration to the received input sequence. 2. The method of claim 1 , wherein each symbol is a word and each n-gram is a sequence of n symbols, where n is at least two. 3. The method of claim 1 , wherein the method includes the first iteration and at least one subsequent iterations and wherein the extended output sequence includes the first output sequence from the first iteration and the subsequent output sequence from each of the at least one subsequent iteration. 4. The method of claim 1 , wherein the method further comprises generating the reduced directed graph from an input directed graph by application of at least one reduction rule to the input directed graph, the input directed graph being generated from the n-gram statistics. 5. The method of claim 4 , wherein the input directed graph includes a set of nodes connected by edges, each of the edges being labeled with a sequence of symbols and an associated multiplicity representing a number of occurrences of the sequence of symbols in the corpus sequence, each path through the input directed graph in which each edge is traversed the respective multiplicity of times reconstructing the corpus sequence from the labels in the order in which the edges are traversed. 6. The method of claim 5 , wherein in the input directed graph, each label consists of an n-gram, each of the n-grams of the labels having a same number of symbols. 7. The method of claim 4 , wherein the reduced directed graph is generated from the input directed graph by iteratively applying at least one reduction rule to generate a reduced directed graph, the at least one reduction rule comprising a reduction rule selected from: a first reduction rule configured for identifying a division point node of the input directed graph, or of a reduced directed graph generated therefrom, which divides the respective graph into at least two connected components and wherein there is a unique incoming edge to the division point node in the first connected component and a unique outgoing edge from the division point node in the second connected component, the first reduction rule configured for generating a new edge with a label which is derived from the labels of the unique incoming edge and the unique outgoing edge; and a second reduction rule configured for identifying a node of the input directed graph or of a reduced directed graph generated therefrom which includes a first incoming edge and a first outgoing edge for which the multiplicity of one of the first incoming edge and the first outgoing edge is greater than a degree of the node minus the multiplicity of the other of the first incoming edge and the first outgoing edge, where the degree of the node is a sum of incoming edges of the node multiplied by their multiplicities, the second reduction rule configured for merging the first incoming edge and the first outgoing edge to create a new edge with a label which is derived from the first incoming edge and the first outgoing edge. 8. The method of claim 7 , wherein the at least one reduction rule comprises the first reduction rule and the second reduction rule. 9. The method of claim 1 , wherein the generating of the extended output sequence comprises, for the at least one subsequent iteration: identifying a set of the labels for the edges in the reduced directed graph, which each include, as a prefix, at least the last symbol of the document sequence generated in the prior iteration, and as a suffix, at least one subsequent symbol; sampling a label from the set, and generating the subsequent output sequence from the suffix of the label. 10. The method of claim 1 , wherein n is at least 3. 11. The method of claim 1 , further comprising generating the input directed graph from the n-gram statistics. 12. The method of claim 11 , wherein the input directed graph comprises a de Bruijn graph. 13. The method of claim 12 , wherein the corpus sequence is derived from at least one document authored by the user. 14. The method of claim 1 , wherein the outputting includes proposing, on the graphical user interface, at least one of the first output sequence from the first iteration and the extended output sequence from at least one of the at least one subsequent iterations to a user for appending to the input sequence. 15. The method of claim 1 , wherein the outputting comprises outputting a plurality of candidate extended output sequences for appending to the input sequence, each of the candidate extended output sequences comprising a different sequence of symbols. 16. A computer program product comprising a non-transitory recording medium storing instructions, which when executed on a computer, causes the computer to perform the method of claim 1 . 17. A method for generating a text generation system comprising: receiving n-gram statistics for a set of n-grams, generated for a corpus of documents or receiving the set of n-grams and generating the n-gram statistics therefrom; generating an input directed graph based on the n-gram statistics, edges of the input directed graph being joined through nodes of the graph, each edge having an associated label corresponding to a sequence of symbols in the corpus and a multiplicity, representing a number of occurrences of the sequence of symbols in the corpus,

Assignees

Inventors

Classifications

  • G06F40/253Primary

    Grammatical analysis; Style critique · CPC title

  • Natural language generation · CPC title

  • Converting codes to words; Guess-ahead of partial word inputs · CPC title

  • Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9740679B2 cover?
A method for generating an output sequence includes receiving an input sequence of symbols. An output sequence is generated from a reduced directed graph derived from n-gram statistics for a corpus sequence of symbols. The graph includes nodes connected by edges that are labeled with a sequence of symbols and associated with a multiplicity representing a number of occurrences of the sequence of…
Who is the assignee on this patent?
Xerox Corp
What technology area does this patent fall under?
Primary CPC classification G06F40/253. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 22 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).