Automatically generating test/training questions and answers through pattern based analysis and natural language processing techniques on the given corpus for quick domain adaptation

US10339453B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10339453-B2
Application numberUS-201314139589-A
CountryUS
Kind codeB2
Filing dateDec 23, 2013
Priority dateDec 23, 2013
Publication dateJul 2, 2019
Grant dateJul 2, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A mechanism is provided in a data processing system for automatically generating question and answer pairs for training a question answering system for a given domain. The mechanism identifies a set of patterns of components in passages within a corpus of documents for the given domain. The mechanism identifies a set of rules that correspond to the set of patterns for generating question and answer pairs from the passages within the corpus of documents. The mechanism applies the set of rules to the passages to generate the question and answer pairs.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, in a data processing system configured with a computer readable program that causes the data processing system to implement a question and answer creation system executing on a processor of the data processing system for automatically generating question and answer pairs for training a question answering system for a given domain, the method comprising: automatically identifying, by the question and answer creation system executing on the processor of the data processing system, a set of most frequently occurring patterns of components in passages within a corpus of documents for the given domain using an unsupervised technique; automatically filtering the set of most frequently occurring patterns to remove frequently occurring patterns that are unlikely to result in meaningful questions based on a domain dictionary to form a filtered set of patterns; identifying, by the question and answer creation system, a set of rules that correspond to the filtered set of patterns for generating question and answer pairs from the passages within the corpus of documents; storing, by the question and answer creation system, the filtered set of patterns in association with the set of rules in a pattern-rules mapping storage; identifying, by the question and answer creation system, an identified set of passages in the corpus that match the filtered set of patterns in the pattern-rules mapping storage; performing, by the question and answer creation system, pre-processing on the set of passages to select a subset of the passages in the identified set of passages to be used for generating question and answer pairs to form a selected set of passages, wherein the pre-processing collects metadata attributes of the identified set of passages to select the selected set of passages; applying, by the question and answer creation system, the set of rules in the pattern-rules mapping storage to the selected set of passages to generate a set of question and answer pairs; performing, by the question and answer creation system, post-processing on the set of question and answer pairs using the metadata attributes to form a final set of question and answer pairs, wherein performing post-processing comprises ordering questions by similarity; merging similar questions with the same answer; scoring similar questions with different answers; and applying an analytic algorithm to the similar questions to resolve conflicts and generate new questions; and training a question answering system using the final set of question and answer pairs. 2. The method of claim 1 , wherein performing pre-processing comprises collecting the metadata attributes based on syntactic and semantic clues from a document in which each given passage in the set of passages occurs. 3. The method of claim 1 , wherein the components of the patterns are selected from a group consisting of: words, part-of-speech tags, named entities, or subject-predicate relations. 4. The method of claim 1 , wherein identifying the set of rules utilizes techniques selected from a group consisting of: pronoun disambiguation, anaphora resolution, language linguistics, sentence relationships, frequency, or lexical databases. 5. The method of claim 1 , further comprising ranking the generated question and answer pairs and using a high ranked subset of question and answer pairs to train the question answering system. 6. A computer program product comprising a non-transitory computer readable storage medium having a computer readable program stored therein, wherein a computing device configured with the computer readable program implements a question and answer creation system executing on a processor of the computing device for automatically generating question and answer pairs for training a question answering system for a given domain, wherein the computer readable program causes the computing device to: automatically identify, by the question and answer creation system executing on the computing device, a set of most frequently occurring patterns of components in passages within a corpus of documents for the given domain using an unsupervised technique; automatically filtering the set of most frequently occurring patterns to remove frequently occurring patterns that are unlikely to result in meaningful questions based on a domain dictionary to form a filtered set of patterns; identify, by the question and answer creation system, a set of rules that correspond to the filtered set of patterns for generating question and answer pairs from the passages within the corpus of documents; store, by the question and answer creation system, the filtered set of patterns in association with the set of rules in a pattern-rules mapping storage; identify, by the question and answer creation system, an identified set of passages in the corpus that match the filtered set of patterns in the pattern-rules mapping storage; perform, by the question and answer creation system, pre-processing on the set of passages to select a subset of the passages in the identified set of passages to be used for generating question and answer pairs to form a selected set of passages, wherein the pre-processing collects metadata attributes of the identified set of passages to select the selected set of passages; apply, by the question and answer creation system, the set of rules in the pattern-rules mapping storage to the selected set of passages to generate a set of question and answer pairs; perform, by the question and answer creation system, post-processing on the set of question and answer pairs using the metadata attributes to form a final set of question and answer pairs, wherein performing post-processing comprises ordering questions by similarity; merging similar questions with the same answer; scoring similar questions with different answers; and applying an analytic algorithm to the similar questions to resolve conflicts and generate new questions; and train a question answering system using the final set of question and answer pairs. 7. The computer program product of claim 6 , wherein performing pre-processing comprises collecting the metadata attributes based on syntactic and semantic clues from a document in which each given passage in the set of passages occurs. 8. The computer program product of claim 6 , wherein the components of the patterns are selected from a group consisting of: words, part-of-speech tags, named entities, or subject-predicate relations. 9. The computer program product of claim 6 , wherein identifying the set of rules utilizes techniques selected from a group consisting of: pronoun disambiguation, anaphora resolution, language linguistics, sentence relationships, frequency, or lexical databases. 10. The computer program product of claim 6 , wherein the computer readable program further causes the computing device to rank the generated question and answer pairs and using a high ranked subset of question and answer pairs to train the question answering system. 11. An apparatus comprising: a processor; and a memory coupled to the processor, wherein the memory comprises a computer readable program, wherein the apparatus configured with the computer readable program implements a question and answer creation system executing on the processor for automatically generating question and answer pairs for training a question answering system for a given domain, wherein the computer readable program causes the processor to: automatically identify, by the question and answer creation system, a set of most frequently occurring patterns of components in passages within a corpus of documents for the given domain using an unsupervised technique; automatically fi

Assignees

Inventors

Classifications

  • G06N5/025Primary

    Extracting rules from data · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10339453B2 cover?
A mechanism is provided in a data processing system for automatically generating question and answer pairs for training a question answering system for a given domain. The mechanism identifies a set of patterns of components in passages within a corpus of documents for the given domain. The mechanism identifies a set of rules that correspond to the set of patterns for generating question and an…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N5/025. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 02 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).