Template bootstrapping for domain-adaptable natural language generation

US10095692B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10095692-B2
Application numberUS-201514726119-A
CountryUS
Kind codeB2
Filing dateMay 29, 2015
Priority dateNov 29, 2012
Publication dateOct 9, 2018
Grant dateOct 9, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present invention relates to a system and method for bootstrapping templates for use in natural language sentence generation. More specifically, the present invention relates to identifying a set of candidate sentences from a large corpus based on a set of original templates by using a similarity measure. The set of candidate sentences are then processed or cleaned to generate a set of templates for use in natural language sentence generation.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer implemented method comprising: a) receiving by a computer comprising a processor and a memory a set of original templates and storing the set of original templates in the memory; b) accessing by a computer a set of databases comprising a large corpus of documents and searching by a search engine the set of databases based on the set of original templates; c) identifying by the search engine a set of candidate sentences from a set of documents in the corpus by using a similarity measure to determine a similarity score, wherein the similarity measure comprises extracting a first set of tokens from at least one template from the set of original templates and extracting a second set of tokens from at least one candidate sentence from the set of candidate sentences, the first set of tokens and the second set of tokens each comprising a set of token-level 1 to token-level n grams, and further comprises comparing the extracted first set of tokens with the extracted second set of tokens by determining a first value representing an intersection of the extracted first and second sets of tokens, and dividing that first value by a second value derived by applying a minimum function to the extracted first and second sets of tokens to determine the similarity score; d) automatically eliminating candidate sentences from the set of candidate sentences based upon a similarity score threshold to arrive at a reduced set of candidate sentences determined to be syntactically similar to the at least one template; and e) processing the reduced set of candidate sentences to generate a set of natural language generation templates that, when processed by a computer and combined with a set of determined words or phrases, generate natural language text. 2. The method of claim 1 further comprising sorting the set of candidate sentences based on the similarity score. 3. The method of claim 1 further comprising identifying all sentences in the corpus by splitting each sentence from each other sentence for every document in the corpus. 4. The method of claim 1 further comprising wherein the similarity measure comprises the formula:  gram_set ⁢ ( n , s ⁢ ⁢ 1 ) ⋂ gram_set ⁢ ( n , s ⁢ ⁢ 2 )  min ⁡ (  gram_set ⁢ ( n , s ⁢ ⁢ 1 )  ,  gram_set ⁢ ( n , s ⁢ ⁢ 2 )  ) > θ wherein s1 represents a first sentence and s2 represents a second sentence and wherein gram_set(n, s1) and gram_set(n, s2) each extract the token level 1 to n-grams from a sentence. 5. The method of claim 1 wherein the identifying further comprises identifying a set of syntactically similar sentences that are not identical to any template in the set of original templates and that comprise a set of semantic characteristics similar to the set of original templates. 6. The method of claim 1 further comprising determining if the similarity score for a sentence and a template from the set of original templates is higher than a determined threshold and placing the sentence in the set of candidate sentences. 7. The method of claim 1 wherein the identifying further comprises identifying a set of candidate sentences that relate to a topic similar to a topic associated with the set of original templates. 8. The method of claim 1 further comprising wherein the set of original templates are manually generated for a domain. 9. The method of claim 1 further comprising wherein the large corpus of documents is a news corpus. 10. The method of claim 1 further comprising generating by a computer a set of natural language sentences based on the set of natural language templates. 11. A system for bootstrapping a set of templates for generating natural language sentences, the system comprising: a) at least one database comprising a corpus of documents; b) a computer comprising a processor and a memory, the memory containing a set of executable code executable by the processor; c) a search controller configured to receive a set of original templates and generate a query based on the set of original templates; d) a search engine adapted to receive the query from the search controller and search the corpus of documents using the query based on the set of original templates to identify a set of candidate sentences from the corpus of documents; e) a template analyzer adapted to: i) select a set of similar sentences from the identified set of candidate sentences by using a similarity measure to determine a similarity score for each selected sentence, wherein the similarity measure comprises extracting a first set of tokens from at least one templat

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10095692B2 cover?
The present invention relates to a system and method for bootstrapping templates for use in natural language sentence generation. More specifically, the present invention relates to identifying a set of candidate sentences from a large corpus based on a set of original templates by using a similarity measure. The set of candidate sentences are then processed or cleaned to generate a set of temp…
Who is the assignee on this patent?
Thomson Reuters Global Resources, Thornson Reuters Global Resources Unlimited Company
What technology area does this patent fall under?
Primary CPC classification G06F40/56. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 09 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).