Cognitive system with ingestion of natural language documents with embedded code
US-9606990-B2 · Mar 28, 2017 · US
US9842096B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9842096-B2 |
| Application number | US-201615152826-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 12, 2016 |
| Priority date | May 12, 2016 |
| Publication date | Dec 12, 2017 |
| Grant date | Dec 12, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A mechanism is provided in a data processing system for identifying nonsense passages in documents being ingested into a corpus. A natural language processing pipeline configured to execute in the data processing system receives an input document to be ingested into a corpus. The natural language processing pipeline divides the input document into a plurality of input passages. A filter component of the natural language processing pipeline identifies whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts. The natural language processing pipeline filters each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages. The natural language processing pipeline adds the filtered plurality of input passages into the corpus.
Opening claim text (preview).
What is claimed is: 1. A method, in a data processing system, for identifying nonsense passages in documents being ingested into a corpus, the method comprising: receiving, by a natural language processing pipeline configured to execute in the data processing system, an input document to be ingested into a corpus; dividing, by the natural language processing pipeline, the input document into a plurality of input passages; identifying, by a filter component of the natural language processing pipeline, whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises: annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage; counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts; determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold; filtering, by the natural language processing pipeline, each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and adding, by the natural language processing pipeline, the filtered plurality of input passages into the corpus. 2. The method of claim 1 , wherein filtering each input passage comprises removing a given input passage responsive to the given input passage being identified as a nonsense passage. 3. The method of claim 1 , wherein filtering each input passage comprises marking a given input passage responsive to the given input passage being identified as a nonsense passage. 4. The method of claim 1 , wherein annotating the input passage comprises annotating the input passage for linguistic part-of-speech features. 5. The method of claim 1 , wherein the metric comprises a ratio of a number of instances of a first pert-of-speech to a number of instances of a second part-of-speech in the input passage. 6. The method of claim 1 , wherein the metric and the predetermined model threshold are defined in a profile data structure. 7. The method of claim 1 , further comprising: responsive to receiving a candidate evidence passage for a candidate answer in a question answering system, determining whether the candidate evidence passage is a nonsense passage; and filtering the candidate evidence passage out of natural language processing responsive to determining the candidate evidence passage is a nonsense passage. 8. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program comprises a natural language processing pipeline configured to execute on a data processing system to: receive an input document to be ingested into a corpus; divide the input document into a plurality of input passages; identify whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises: annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage; counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts; determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold; filter each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and add the filtered plurality of input passages into the corpus. 9. The computer program product of claim 8 , wherein filtering each input passage comprises removing a given input passage responsive to the given input passage being identified as a nonsense passage. 10. The computer program product of claim 8 , wherein filtering each input passage comprises marking a given input passage responsive to the given input passage being identified as a nonsense passage. 11. The computer program product of claim 8 , wherein annotating the input passage comprises annotating the input passage for linguistic part-of-speech features. 12. The computer program product of claim 8 , wherein the metric comprises a ratio of a number of instances of a first part-of-speech to a number of instances of a second part-of-speech in the input passage. 13. The computer program product of claim 8 , wherein the metric and the predetermined model threshold are defined in a profile data structure. 14. The computer program product of claim 8 , wherein the natural language processing pipeline further causes the data processing system to: responsive to receiving a candidate evidence passage for a candidate answer in a question answering system, determine whether the candidate evidence passage is a nonsense passage; and filter the candidate evidence passage out of natural language processing responsive to determining the candidate evidence passage is a nonsense passage. 15. An apparatus comprising: a processor, and a memory coupled to the processor, wherein the memory comprises instructions which, when executed by the processor, cause the processor to: receive, by a natural language processing pipeline configured to execute in the data processing system, an input document to be ingested into a corpus; divide, by the natural language processing pipeline, the input document into a plurality of input passages; identify, by a filter component of the natural language processing pipeline, whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises: annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage; counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts; determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold; filter, by the natural language processing pipeline, each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plu
Natural language query formulation · CPC title
Annotation, e.g. comment data or footnotes · CPC title
Selection or weighting of terms for indexing · CPC title
Ontology · CPC title
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.