Pre-processing for identifying nonsense passages in documents being ingested into a corpus of a natural language processing system

US9842096B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9842096-B2
Application numberUS-201615152826-A
CountryUS
Kind codeB2
Filing dateMay 12, 2016
Priority dateMay 12, 2016
Publication dateDec 12, 2017
Grant dateDec 12, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A mechanism is provided in a data processing system for identifying nonsense passages in documents being ingested into a corpus. A natural language processing pipeline configured to execute in the data processing system receives an input document to be ingested into a corpus. The natural language processing pipeline divides the input document into a plurality of input passages. A filter component of the natural language processing pipeline identifies whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts. The natural language processing pipeline filters each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages. The natural language processing pipeline adds the filtered plurality of input passages into the corpus.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, in a data processing system, for identifying nonsense passages in documents being ingested into a corpus, the method comprising: receiving, by a natural language processing pipeline configured to execute in the data processing system, an input document to be ingested into a corpus; dividing, by the natural language processing pipeline, the input document into a plurality of input passages; identifying, by a filter component of the natural language processing pipeline, whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises: annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage; counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts; determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold; filtering, by the natural language processing pipeline, each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and adding, by the natural language processing pipeline, the filtered plurality of input passages into the corpus. 2. The method of claim 1 , wherein filtering each input passage comprises removing a given input passage responsive to the given input passage being identified as a nonsense passage. 3. The method of claim 1 , wherein filtering each input passage comprises marking a given input passage responsive to the given input passage being identified as a nonsense passage. 4. The method of claim 1 , wherein annotating the input passage comprises annotating the input passage for linguistic part-of-speech features. 5. The method of claim 1 , wherein the metric comprises a ratio of a number of instances of a first pert-of-speech to a number of instances of a second part-of-speech in the input passage. 6. The method of claim 1 , wherein the metric and the predetermined model threshold are defined in a profile data structure. 7. The method of claim 1 , further comprising: responsive to receiving a candidate evidence passage for a candidate answer in a question answering system, determining whether the candidate evidence passage is a nonsense passage; and filtering the candidate evidence passage out of natural language processing responsive to determining the candidate evidence passage is a nonsense passage. 8. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program comprises a natural language processing pipeline configured to execute on a data processing system to: receive an input document to be ingested into a corpus; divide the input document into a plurality of input passages; identify whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises: annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage; counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts; determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold; filter each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and add the filtered plurality of input passages into the corpus. 9. The computer program product of claim 8 , wherein filtering each input passage comprises removing a given input passage responsive to the given input passage being identified as a nonsense passage. 10. The computer program product of claim 8 , wherein filtering each input passage comprises marking a given input passage responsive to the given input passage being identified as a nonsense passage. 11. The computer program product of claim 8 , wherein annotating the input passage comprises annotating the input passage for linguistic part-of-speech features. 12. The computer program product of claim 8 , wherein the metric comprises a ratio of a number of instances of a first part-of-speech to a number of instances of a second part-of-speech in the input passage. 13. The computer program product of claim 8 , wherein the metric and the predetermined model threshold are defined in a profile data structure. 14. The computer program product of claim 8 , wherein the natural language processing pipeline further causes the data processing system to: responsive to receiving a candidate evidence passage for a candidate answer in a question answering system, determine whether the candidate evidence passage is a nonsense passage; and filter the candidate evidence passage out of natural language processing responsive to determining the candidate evidence passage is a nonsense passage. 15. An apparatus comprising: a processor, and a memory coupled to the processor, wherein the memory comprises instructions which, when executed by the processor, cause the processor to: receive, by a natural language processing pipeline configured to execute in the data processing system, an input document to be ingested into a corpus; divide, by the natural language processing pipeline, the input document into a plurality of input passages; identify, by a filter component of the natural language processing pipeline, whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises: annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage; counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts; determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold; filter, by the natural language processing pipeline, each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plu

Assignees

Inventors

Classifications

  • Natural language query formulation · CPC title

  • G06F40/169Primary

    Annotation, e.g. comment data or footnotes · CPC title

  • Selection or weighting of terms for indexing · CPC title

  • Ontology · CPC title

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9842096B2 cover?
A mechanism is provided in a data processing system for identifying nonsense passages in documents being ingested into a corpus. A natural language processing pipeline configured to execute in the data processing system receives an input document to be ingested into a corpus. The natural language processing pipeline divides the input document into a plurality of input passages. A filter compone…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/169. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 12 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).