Multi-feature balancing for natural language processors
US-2024419910-A1 · Dec 19, 2024 · US
US9971763B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9971763-B2 |
| Application number | US-201414248113-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 8, 2014 |
| Priority date | Apr 8, 2014 |
| Publication date | May 15, 2018 |
| Grant date | May 15, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Named entity recognition is described, for example, to detect an instance of a named entity in a web page and classify the named entity as being an organization or other predefined class. In various examples, named entity recognition results are used to augment text from which the named entity was recognized; the augmentation may comprise information retrieval results about the named entity mention. In various embodiments, labeled training sentences in many different languages and for many different classes, are obtained to train machine learning components of a multi-lingual, multi-class, named entity recognition system. In examples, labeled training sentences are obtained from at least two sources, a first source using a multi-lingual or monolingual corpus of inter-linked documents and a second source using machine translation training data. In examples, labeled training sentences from the two sources are selectively sampled for training the named entity recognition system.
Opening claim text (preview).
The invention claimed is: 1. A computer-implemented method comprising: labeling text items in a first set of text, in a first language, from a multi-lingual digital document corpus, with first labels indicating first named entity classes by inheriting respective class labels of corresponding documents in the corpus into anchors in the first set of text which hyperlink the corresponding documents; generating respective confidence scores for the text items in the first set of text labeled with the named entity classes, the confidence scores including weights based on features of the labeled text items and transition probabilities among respective ones of the labeled text items; applying heuristic rules to remove labels having confidence scores lower than the confidence scores of other labeled text items from the first set of text; automatically labeling text items in a second set of text of a second language from the multi-lingual digital document corpus, from parallel sentences in the first set of text, the text items in the second set of text being labeled with second labels indicating second named entity classes, the parallel sentences being pairs of sentences with a same semantic meaning in the first and second languages; selecting a subset of the parallel sentences including labeled text from the first and second sets of text, wherein, for the selected parallel sentences, the text items having disagreeing labels with lower confidence scores than the text items having agreeing labels and wherein the subset comprises fewer than a total number of the parallel sentences of the first set of labeled text and the second set of labeled text; using the selected subset of the parallel sentences to train a machine learning component; and using the trained machine learning component to label a third set of text, in a language of the multi-lingual document corpus, with one or more of the first labels or the second labels. 2. The method of claim 1 , comprising selecting the subset based at least in part on coverage criteria related to diversity of the first set of labeled text and the second set of labeled text. 3. The method of claim 1 , wherein selecting the subset comprises selecting the subset to increase diversity and frequencies of single-token labels within the subset, and wherein the single-token labels comprise individual words with corresponding named entity class labels. 4. The method of claim 1 , further comprising: using the trained machine learning component to label the third set of text and calculate the confidence scores associated with the one or more of the first labels or the second labels; and selecting the subset of text items having higher confidence scores than other text items. 5. The method of claim 1 , further comprising: automatically labeling the second set of text from the parallel sentences for each pair of sentences by projecting the second labels indicating the second named entity classes in a source language sentence of the pair of sentences to a parallel target language sentence of the pair of sentences; and selecting one or more target language sentences where corresponding projected labels are contiguous in a target language. 6. The method of claim 1 , further comprising: automatically labeling the second set of text from the parallel sentences for each pair of sentences by projecting the second labels indicating the second named entity classes in a source language sentence of the pair of sentences to a parallel target language sentence of the pair of sentences; and selecting one or more target language sentences where all labels from a source language were successfully projected. 7. The method of claim 1 , further comprising: automatically labeling the second set of text from the parallel sentences for each pair of sentences by projecting the second labels, with confidence levels above a threshold, indicating a minimum confidence in projecting the second named entity classes in a source language sentence of the pair of sentences to a parallel target language sentence of the pair of sentences. 8. The method of claim 1 , wherein labeling the first set of text in the first language from the multi-lingual document corpus comprises classifying a plurality of documents of the document corpus into entity classes using at least one multi-class classifier having been bootstrapped using seed documents and iteratively trained, the seed documents being documents labeled with pre-specified entity classes. 9. The method of claim 8 , further comprising using a plurality of multi-class classifiers to classify different language documents of the multi-lingual document corpus. 10. The method of claim 8 , further comprising labeling respective text of the plurality of documents by collecting one or more sentences with hyperlinks to documents classified into one or more entity classes. 11. The method of claim 8 , further comprising labeling respective text of the plurality of documents by identifying named entity mentions which are not hyperlinked and classifying the identified mentions using any one or more of: title lists, aliases, redirects, or acronyms. 12. The method of claim 8 , further comprising labeling respective text of the plurality of documents by disambiguating ambiguous named entity examples that refer to multiple named entity classes. 13. The method of claim 8 , further comprising labeling respective text of the plurality of documents by labeling person names using a vocabulary generated from titles of categorized documents. 14. The method of claim 8 , wherein applying the heuristic rules includes labeling respective text of the plurality of documents by removing uncertain labels that are identified based at least in part on a common words lexicon. 15. A computer-implemented method comprising: labeling text items in a first set of text, in a source language from a multi-lingual document corpus, with first labels indicating first named entity classes by inheriting respective class labels of corresponding documents in the corpus into anchors in the first set of text which hyperlink the corresponding documents; generating respective confidence scores for the text items in the first set of text labeled with the named entity classes, the confidence scores including weights based on features of the labeled text items and transition probabilities among respective ones of the labeled text items; applying heuristic rules to remove labels having confidence scores lower than confidence scores of other labeled text items from the first set of text; automatically labeling text items a second set of text of a parallel target language from the multi-lingual digital document corpus, from parallel sentences in the first set of text, the text items in the second set of text being labeled with second labels indicating second named entity classes by projecting at least one of the first labels indicating a named entity class of the first named entity classes in a source language sentence to a parallel target language sentence, the parallel sentences being pairs of sentences with a same semantic meaning in the source and target languages; selecting a subset of the second set of labeled text including at least one target language sentence where all labels from a source language were successfully projected wherein, for the selected subset of the second set of labeled text, the text items having disagreeing labels, relative to a the first set of labeled text in the source language, have lower confidence scores than text items having agreeing labels and wherein the subset comprises fewer than a total number o
Named entity recognition · CPC title
Example-based machine translation; Alignment · CPC title
Physics · mapped topic
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.