Discrepancy Handler for Document Ingestion into a Corpus for a Cognitive Computing System
US-2017169017-A1 · Jun 15, 2017 · US
US10796241B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10796241-B2 |
| Application number | US-201514927766-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 30, 2015 |
| Priority date | Oct 30, 2015 |
| Publication date | Oct 6, 2020 |
| Grant date | Oct 6, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method and associated systems for forecastable supervised labels and corpus sets for training a natural-language processing system. An NLP-training system asks an “oracle” expert to answer a predictive test question and, in response, receives from the oracle an answer, rationales for selecting that answer, and identifications of extrinsic natural-language sources of evidence that supports those rationales. The system retrieves updated versions of that evidence at a later time, and returns that updated evidence to the oracle. In response, the oracle returns an updated answer and rationales based on the updated evidence. The system then compares time-varying characteristics of the evidence in order to determine the relative contributions of each piece of evidence to the oracles' selections. Less relevant evidence is discarded and the remaining, optimized, evidence is forwarded to the NLP system to be used as training data.
Opening claim text (preview).
What is claimed is: 1. A natural language processing training (NLP-training) system comprising a processor, a memory coupled to the processor, and a computer-readable hardware storage device coupled to the processor, the storage device containing program code configured to be run by the processor via the memory to implement a method for forecastable supervised labels and corpus sets for training a natural-language processing system, the method comprising: the training system selecting an oracle, wherein the oracle is a human expert or a computerized expert system that possesses expertise in a particular field of endeavor; the training system receiving from the oracle identifications of a first label, a set of extrinsic electronic sources, and a set of rationales, wherein the first label identifies an answer to a predictive question in the particular field of endeavor, given a set of conditions specified by the predictive question, wherein each rationale of the set of rationales comprises an identification of at least one source of the set of extrinsic sources, and wherein each source of the set of extrinsic sources is a source of initial natural-language content upon which the oracle based the selection of the first label; the training system adding training datasets to a first set of corpora, wherein each training dataset of the training datasets is associated with a subset of natural-language content located at one or more sources of the set of extrinsic sources; the training system retrieving from the one or more extrinsic sources, at a second time, later versions of the natural-language content; the training system creating a second set of corpora by inserting into the first set of corpora the later versions of the natural-language content; the training system communicating the second set of corpora to the oracle; the training system accepting from the oracle, in response to the communicating, a second label; the training system deleting at least a first training dataset from the second set of corpora when a degree of relevance of the first training dataset falls below a predetermined threshold, wherein the degree of relevance of the first training dataset is proportional to a degree to which i) a difference between the initial and the later versions of the subset of natural-language content associated with the first training dataset influences ii) a difference between the first label and the second label; and the training system training the natural-language processing system by submitting the second set of corpora to a training function of a machine-learning application. 2. The NLP-training system of claim 1 , wherein the oracle selects the second label as being an answer to the predictive question as a function of the later versions of the natural-language content. 3. The NLP-training system of claim 1 , wherein an initial version of a first document, of the first training dataset, is a subset of the initial natural-language content retrieved from a particular source of the one or more extrinsic sources, wherein the creating the second set of corpora further comprises replacing, in the first training dataset, the initial version of the first document with an updated version of the first document, and wherein the updated version of the first document is generated by retrieving, from the particular source, an updated version of the subset of natural-language content and then updating the first document with the updated version of the particular natural-language content. 4. The NLP-training system of claim 1 , wherein the degree of relevance of the first training dataset is proportional to a value of a lag variable associated with the first training dataset, wherein the lag variable measures a pattern of change in a time-varying characteristic of the first training dataset over a period of time. 5. The NLP-training system of claim 4 , wherein the training system deletes the first training dataset from the second set of corpora if a value of the lag variable indicates that the degree of relevance of the first training dataset has fallen below the predetermined threshold at the second time. 6. The NLP-training system of claim 4 , wherein the degree of relevance of the first training dataset is further determined by: the system transmitting to the oracle time-stamped metadata identified by values of the lag variable, wherein the time-stamped metadata identify time-specific values of a characteristic of the first training dataset; and the system receiving from the oracle, in response to the transmitting, an identification of a degree of contribution of the first training dataset to the oracle's selection of the second label. 7. A method for forecastable supervised labels and corpus sets for training a natural-language processing system, the method comprising: a natural language processing training (NLP-training) system selecting an oracle, wherein the oracle is a human expert or a computerized expert system that possesses expertise in a particular field of endeavor; the training system receiving from the oracle identifications of a first label, a set of extrinsic electronic sources, and a set of rationales, wherein the first label identifies an answer to a predictive question in the particular field of endeavor, given a set of conditions specified by the predictive question, wherein each rationale of the set of rationales comprises an identification of at least one source of the set of extrinsic sources, and wherein each source of the set of extrinsic sources is a source of initial natural-language content upon which the oracle based the selection of the first label; the training system adding training datasets to a first set of corpora, wherein each dataset of the training datasets is associated with a subset of natural-language content located at one or more sources of the set of extrinsic sources; the training system retrieving from the one or more extrinsic sources, at a second time, later versions of the natural-language content; the training system creating a second set of corpora by inserting into the first set of corpora the later versions of the natural-language content; the training system communicating the second set of corpora to the oracle; the training system accepting from the oracle, in response to the communicating, a second label; the training system deleting at least a first training dataset from the second set of corpora when a degree of relevance of the first training dataset falls below a threshold, wherein the degree of relevance of the first training dataset is proportional to a degree to which i) a difference between the initial and the later versions of the subset of natural-language content associated with the first training dataset influences ii) a difference between the first label and the second label; and the training system training the natural-language processing system by submitting the second set of corpora to a training function of a machine-learning application. 8. The method of claim 7 , wherein the oracle selects the second label as being an answer to the predictive question as a function of the later versions of the natural-language content. 9. The method of claim 7 , wherein an initial version of a first document, of the first training dataset, is a subset of the initial natural-language content retrieved from a particular source of the one or more extrinsic sources, wherein the creating the second set of corpora further comprises replacing, in the first training dataset, the initial version of the first document with an updated version of the first document, and wherein the updated version of the first document is generated by retrieving, from t
Related publications grouped by family.
Answers are generated from the same data shown on this page.