Retrieving text from a corpus of documents in an information handling system

US9727637B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9727637-B2
Application numberUS-201414462662-A
CountryUS
Kind codeB2
Filing dateAug 19, 2014
Priority dateAug 19, 2014
Publication dateAug 8, 2017
Grant dateAug 8, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A mechanism is provided for retrieving candidate answers from a corpus of documents. The mechanism receives an input question for which an answer is sought. The mechanism extracts features of the input question based on a natural language processing. The mechanism executes a first search of the corpus of documents based on a first subset of the extracted features of the input question and an initial evaluation of a utility of the first subset of extracted features to generate a subset of documents. The mechanism executes a second search of a set of passages extracted from the subset of documents based on a second subset of the extracted features of the input question and a reevaluation of the utility of the second subset of extracted features thereby forming a subset of passages. The mechanism generates query results from the subset of passages matching from which candidate answers are identified.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, in a question and answer (QA) system comprising a processor and a memory, for retrieving candidate answers from a corpus of documents, the method comprising: receiving, by the QA system, an input question for which an answer is sought; extracting, by the QA system, features of the input question based on a natural language processing of the input question; executing, by the QA system, a first search of the corpus of documents based on a first subset of the extracted features of the input question and an initial evaluation of a utility of the first subset of extracted features to generate a subset of documents matching the first subset of extracted features, wherein the utility of the first subset of extracted features identifies a degree to which each feature of the first subset of extracted features of the input question discriminates between documents in the corpus of documents that are sources of candidate answers to the input question; executing, by the QA system, a second search of a set of passages extracted from the subset of documents based on a second subset of the extracted features of the input question and a reevaluation of the utility of the second subset of extracted features thereby forming a subset of passages, wherein the utility of the second subset of extracted features identifies a degree to which each feature of the second subset of extracted features of the input question discriminates between passages in the set of passages that are sources of candidate answers to the input question; and generating, by the QA system, query results from the subset of passages from which a set of candidate answers for the input question are identified. 2. The method of claim 1 , wherein the set of passages extracted from the subset of documents is less than all of the passages included in the subset of documents. 3. The method of claim 1 , wherein executing the first search of the corpus of documents based on the first subset of the extracted features of the input question and the initial evaluation of the utility of the first subset of extracted features to generate the subset of documents matching the first subset of extracted features comprises: generating, by the QA system, a first statistical data structure for the corpus of documents; and identifying, by the QA system, the subset of documents from the corpus of documents comprised within the first statistical data structure relevant to the first subset of the extracted features utilizing the initial evaluation of the utility of the first subset of extracted features. 4. The method of claim 1 , wherein executing the second search of the set of passages extracted from the subset of documents based on the second subset of the extracted features of the input question and the reevaluation of the utility of the second subset of extracted features comprises: generating, by the QA system, a second statistical data structure for the set of passages; and identifying, by the QA system, the query results from the subset of passages comprised within the second statistical data structure relevant to the second subset of the extracted features utilizing the reevaluation of the utility of the second subset of extracted features. 5. The method of claim 1 , wherein the extracted features of the input question are identified by: identifying, by the QA system, a utility of each term in the input question; eliminating, by the QA system, zero or more terms within the input question that comprise a utility less than a predetermined value; and adding, by the QA system, the remaining terms in the input question to the extracted features. 6. The method of claim 5 , wherein the extracted features of the input question are further identified by: identifying, by the QA system, one or more synonyms associated with the terms added to the extracted features; and adding, by the QA system, the one or more synonyms associated with the terms to the extracted features. 7. The method of claim 5 , wherein the extracted features of the input question are further identified by: identifying, by the QA system, one or more tenses associated with the terms added to the extracted features; and adding, by the QA system, the one or more tenses associated with the terms to the extracted features. 8. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program, when executed on a computing device, causes the computing device to: receive an input question for which an answer is sought; extract features of the input question based on a natural language processing of the input question; execute a first search of a corpus of documents based on a first subset of the extracted features of the input question and an initial evaluation of a utility of the first subset of extracted features to generate a subset of documents matching the first subset of extracted features, wherein the utility of the first subset of extracted features identifies a degree to which each feature of the first subset of extracted features of the input question discriminates between documents in the corpus of documents that are sources of candidate answers to the input question; execute a second search of a set of passages extracted from the subset of documents based on a second subset of the extracted features of the input question and a reevaluation of the utility of the second subset of extracted features thereby forming a subset of passages, wherein the utility of the second subset of extracted features identifies a degree to which each feature of the second subset of extracted features of the input question discriminates between passages in the set of passages that are sources of candidate answers to the input question; and generate query results from the subset of passages from which a set of candidate answers for the input question are identified. 9. The computer program product of claim 8 , wherein the set of passages extracted from the subset of documents is less than all of the passages included in the subset of documents. 10. The computer program product of claim 8 , wherein the computer readable program to execute the first search of the corpus of documents based on the first subset of the extracted features of the input question and the initial evaluation of the utility of the first subset of extracted features to generate the subset of documents matching the first subset of extracted features further causes the computing device to: generate a first statistical data structure for the corpus of documents; and identify the subset of documents from the corpus of documents comprised within the first statistical data structure relevant to the first subset of the extracted features utilizing the initial evaluation of the utility of the first subset of extracted features. 11. The computer program product of claim 8 , wherein the computer readable program to execute the second search of the set of passages extracted from the subset of documents based on the second subset of the extracted features of the input question and the reevaluation of the utility of the second subset of extracted features further causes the computing device to: generate a second statistical data structure for the set of passages; and identify the query results from the subset of passages comprised within the second statistical data structure relevant to the second subset of the extracted features utilizing the reevaluation of the utility of the second subset of extracted features. 12. The computer program product of claim 8 , wherein the extracted features of the i

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9727637B2 cover?
A mechanism is provided for retrieving candidate answers from a corpus of documents. The mechanism receives an input question for which an answer is sought. The mechanism extracts features of the input question based on a natural language processing. The mechanism executes a first search of the corpus of documents based on a first subset of the extracted features of the input question and an in…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F17/30654. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 08 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).