Forecastable supervised labels and corpus sets for training a natural-language processing system

US10796241B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10796241-B2
Application numberUS-201514927766-A
CountryUS
Kind codeB2
Filing dateOct 30, 2015
Priority dateOct 30, 2015
Publication dateOct 6, 2020
Grant dateOct 6, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and associated systems for forecastable supervised labels and corpus sets for training a natural-language processing system. An NLP-training system asks an “oracle” expert to answer a predictive test question and, in response, receives from the oracle an answer, rationales for selecting that answer, and identifications of extrinsic natural-language sources of evidence that supports those rationales. The system retrieves updated versions of that evidence at a later time, and returns that updated evidence to the oracle. In response, the oracle returns an updated answer and rationales based on the updated evidence. The system then compares time-varying characteristics of the evidence in order to determine the relative contributions of each piece of evidence to the oracles' selections. Less relevant evidence is discarded and the remaining, optimized, evidence is forwarded to the NLP system to be used as training data.

First claim

Opening claim text (preview).

What is claimed is: 1. A natural language processing training (NLP-training) system comprising a processor, a memory coupled to the processor, and a computer-readable hardware storage device coupled to the processor, the storage device containing program code configured to be run by the processor via the memory to implement a method for forecastable supervised labels and corpus sets for training a natural-language processing system, the method comprising: the training system selecting an oracle, wherein the oracle is a human expert or a computerized expert system that possesses expertise in a particular field of endeavor; the training system receiving from the oracle identifications of a first label, a set of extrinsic electronic sources, and a set of rationales, wherein the first label identifies an answer to a predictive question in the particular field of endeavor, given a set of conditions specified by the predictive question, wherein each rationale of the set of rationales comprises an identification of at least one source of the set of extrinsic sources, and wherein each source of the set of extrinsic sources is a source of initial natural-language content upon which the oracle based the selection of the first label; the training system adding training datasets to a first set of corpora, wherein each training dataset of the training datasets is associated with a subset of natural-language content located at one or more sources of the set of extrinsic sources; the training system retrieving from the one or more extrinsic sources, at a second time, later versions of the natural-language content; the training system creating a second set of corpora by inserting into the first set of corpora the later versions of the natural-language content; the training system communicating the second set of corpora to the oracle; the training system accepting from the oracle, in response to the communicating, a second label; the training system deleting at least a first training dataset from the second set of corpora when a degree of relevance of the first training dataset falls below a predetermined threshold, wherein the degree of relevance of the first training dataset is proportional to a degree to which i) a difference between the initial and the later versions of the subset of natural-language content associated with the first training dataset influences ii) a difference between the first label and the second label; and the training system training the natural-language processing system by submitting the second set of corpora to a training function of a machine-learning application. 2. The NLP-training system of claim 1 , wherein the oracle selects the second label as being an answer to the predictive question as a function of the later versions of the natural-language content. 3. The NLP-training system of claim 1 , wherein an initial version of a first document, of the first training dataset, is a subset of the initial natural-language content retrieved from a particular source of the one or more extrinsic sources, wherein the creating the second set of corpora further comprises replacing, in the first training dataset, the initial version of the first document with an updated version of the first document, and wherein the updated version of the first document is generated by retrieving, from the particular source, an updated version of the subset of natural-language content and then updating the first document with the updated version of the particular natural-language content. 4. The NLP-training system of claim 1 , wherein the degree of relevance of the first training dataset is proportional to a value of a lag variable associated with the first training dataset, wherein the lag variable measures a pattern of change in a time-varying characteristic of the first training dataset over a period of time. 5. The NLP-training system of claim 4 , wherein the training system deletes the first training dataset from the second set of corpora if a value of the lag variable indicates that the degree of relevance of the first training dataset has fallen below the predetermined threshold at the second time. 6. The NLP-training system of claim 4 , wherein the degree of relevance of the first training dataset is further determined by: the system transmitting to the oracle time-stamped metadata identified by values of the lag variable, wherein the time-stamped metadata identify time-specific values of a characteristic of the first training dataset; and the system receiving from the oracle, in response to the transmitting, an identification of a degree of contribution of the first training dataset to the oracle's selection of the second label. 7. A method for forecastable supervised labels and corpus sets for training a natural-language processing system, the method comprising: a natural language processing training (NLP-training) system selecting an oracle, wherein the oracle is a human expert or a computerized expert system that possesses expertise in a particular field of endeavor; the training system receiving from the oracle identifications of a first label, a set of extrinsic electronic sources, and a set of rationales, wherein the first label identifies an answer to a predictive question in the particular field of endeavor, given a set of conditions specified by the predictive question, wherein each rationale of the set of rationales comprises an identification of at least one source of the set of extrinsic sources, and wherein each source of the set of extrinsic sources is a source of initial natural-language content upon which the oracle based the selection of the first label; the training system adding training datasets to a first set of corpora, wherein each dataset of the training datasets is associated with a subset of natural-language content located at one or more sources of the set of extrinsic sources; the training system retrieving from the one or more extrinsic sources, at a second time, later versions of the natural-language content; the training system creating a second set of corpora by inserting into the first set of corpora the later versions of the natural-language content; the training system communicating the second set of corpora to the oracle; the training system accepting from the oracle, in response to the communicating, a second label; the training system deleting at least a first training dataset from the second set of corpora when a degree of relevance of the first training dataset falls below a threshold, wherein the degree of relevance of the first training dataset is proportional to a degree to which i) a difference between the initial and the later versions of the subset of natural-language content associated with the first training dataset influences ii) a difference between the first label and the second label; and the training system training the natural-language processing system by submitting the second set of corpora to a training function of a machine-learning application. 8. The method of claim 7 , wherein the oracle selects the second label as being an answer to the predictive question as a function of the later versions of the natural-language content. 9. The method of claim 7 , wherein an initial version of a first document, of the first training dataset, is a subset of the initial natural-language content retrieved from a particular source of the one or more extrinsic sources, wherein the creating the second set of corpora further comprises replacing, in the first training dataset, the initial version of the first document with an updated version of the first document, and wherein the updated version of the first document is generated by retrieving, from t

Assignees

Inventors

Classifications

  • G06N20/00Primary

    Machine learning · CPC title

  • Semantic analysis · CPC title

  • Inference or reasoning models · CPC title

  • Data-driven translation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10796241B2 cover?
A method and associated systems for forecastable supervised labels and corpus sets for training a natural-language processing system. An NLP-training system asks an “oracle” expert to answer a predictive test question and, in response, receives from the oracle an answer, rationales for selecting that answer, and identifications of extrinsic natural-language sources of evidence that supports tho…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 06 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).