Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06F40/295. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 15 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Named entity recognition

US9971763B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9971763-B2
Application number	US-201414248113-A
Country	US
Kind code	B2
Filing date	Apr 8, 2014
Priority date	Apr 8, 2014
Publication date	May 15, 2018
Grant date	May 15, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Named entity recognition is described, for example, to detect an instance of a named entity in a web page and classify the named entity as being an organization or other predefined class. In various examples, named entity recognition results are used to augment text from which the named entity was recognized; the augmentation may comprise information retrieval results about the named entity mention. In various embodiments, labeled training sentences in many different languages and for many different classes, are obtained to train machine learning components of a multi-lingual, multi-class, named entity recognition system. In examples, labeled training sentences are obtained from at least two sources, a first source using a multi-lingual or monolingual corpus of inter-linked documents and a second source using machine translation training data. In examples, labeled training sentences from the two sources are selectively sampled for training the named entity recognition system.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method comprising: labeling text items in a first set of text, in a first language, from a multi-lingual digital document corpus, with first labels indicating first named entity classes by inheriting respective class labels of corresponding documents in the corpus into anchors in the first set of text which hyperlink the corresponding documents; generating respective confidence scores for the text items in the first set of text labeled with the named entity classes, the confidence scores including weights based on features of the labeled text items and transition probabilities among respective ones of the labeled text items; applying heuristic rules to remove labels having confidence scores lower than the confidence scores of other labeled text items from the first set of text; automatically labeling text items in a second set of text of a second language from the multi-lingual digital document corpus, from parallel sentences in the first set of text, the text items in the second set of text being labeled with second labels indicating second named entity classes, the parallel sentences being pairs of sentences with a same semantic meaning in the first and second languages; selecting a subset of the parallel sentences including labeled text from the first and second sets of text, wherein, for the selected parallel sentences, the text items having disagreeing labels with lower confidence scores than the text items having agreeing labels and wherein the subset comprises fewer than a total number of the parallel sentences of the first set of labeled text and the second set of labeled text; using the selected subset of the parallel sentences to train a machine learning component; and using the trained machine learning component to label a third set of text, in a language of the multi-lingual document corpus, with one or more of the first labels or the second labels. 2. The method of claim 1 , comprising selecting the subset based at least in part on coverage criteria related to diversity of the first set of labeled text and the second set of labeled text. 3. The method of claim 1 , wherein selecting the subset comprises selecting the subset to increase diversity and frequencies of single-token labels within the subset, and wherein the single-token labels comprise individual words with corresponding named entity class labels. 4. The method of claim 1 , further comprising: using the trained machine learning component to label the third set of text and calculate the confidence scores associated with the one or more of the first labels or the second labels; and selecting the subset of text items having higher confidence scores than other text items. 5. The method of claim 1 , further comprising: automatically labeling the second set of text from the parallel sentences for each pair of sentences by projecting the second labels indicating the second named entity classes in a source language sentence of the pair of sentences to a parallel target language sentence of the pair of sentences; and selecting one or more target language sentences where corresponding projected labels are contiguous in a target language. 6. The method of claim 1 , further comprising: automatically labeling the second set of text from the parallel sentences for each pair of sentences by projecting the second labels indicating the second named entity classes in a source language sentence of the pair of sentences to a parallel target language sentence of the pair of sentences; and selecting one or more target language sentences where all labels from a source language were successfully projected. 7. The method of claim 1 , further comprising: automatically labeling the second set of text from the parallel sentences for each pair of sentences by projecting the second labels, with confidence levels above a threshold, indicating a minimum confidence in projecting the second named entity classes in a source language sentence of the pair of sentences to a parallel target language sentence of the pair of sentences. 8. The method of claim 1 , wherein labeling the first set of text in the first language from the multi-lingual document corpus comprises classifying a plurality of documents of the document corpus into entity classes using at least one multi-class classifier having been bootstrapped using seed documents and iteratively trained, the seed documents being documents labeled with pre-specified entity classes. 9. The method of claim 8 , further comprising using a plurality of multi-class classifiers to classify different language documents of the multi-lingual document corpus. 10. The method of claim 8 , further comprising labeling respective text of the plurality of documents by collecting one or more sentences with hyperlinks to documents classified into one or more entity classes. 11. The method of claim 8 , further comprising labeling respective text of the plurality of documents by identifying named entity mentions which are not hyperlinked and classifying the identified mentions using any one or more of: title lists, aliases, redirects, or acronyms. 12. The method of claim 8 , further comprising labeling respective text of the plurality of documents by disambiguating ambiguous named entity examples that refer to multiple named entity classes. 13. The method of claim 8 , further comprising labeling respective text of the plurality of documents by labeling person names using a vocabulary generated from titles of categorized documents. 14. The method of claim 8 , wherein applying the heuristic rules includes labeling respective text of the plurality of documents by removing uncertain labels that are identified based at least in part on a common words lexicon. 15. A computer-implemented method comprising: labeling text items in a first set of text, in a source language from a multi-lingual document corpus, with first labels indicating first named entity classes by inheriting respective class labels of corresponding documents in the corpus into anchors in the first set of text which hyperlink the corresponding documents; generating respective confidence scores for the text items in the first set of text labeled with the named entity classes, the confidence scores including weights based on features of the labeled text items and transition probabilities among respective ones of the labeled text items; applying heuristic rules to remove labels having confidence scores lower than confidence scores of other labeled text items from the first set of text; automatically labeling text items a second set of text of a parallel target language from the multi-lingual digital document corpus, from parallel sentences in the first set of text, the text items in the second set of text being labeled with second labels indicating second named entity classes by projecting at least one of the first labels indicating a named entity class of the first named entity classes in a source language sentence to a parallel target language sentence, the parallel sentences being pairs of sentences with a same semantic meaning in the source and target languages; selecting a subset of the second set of labeled text including at least one target language sentence where all labels from a source language were successfully projected wherein, for the selected subset of the second set of labeled text, the text items having disagreeing labels, relative to a the first set of labeled text in the source language, have lower confidence scores than text items having agreeing labels and wherein the subset comprises fewer than a total number o

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06F40/295Primary
Named entity recognition · CPC title
G06F40/45
Example-based machine translation; Alignment · CPC title
G06F17/278Primary
Physics · mapped topic
G06F17/2827
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 54209891

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9971763B2 cover?: Named entity recognition is described, for example, to detect an instance of a named entity in a web page and classify the named entity as being an organization or other predefined class. In various examples, named entity recognition results are used to augment text from which the named entity was recognized; the augmentation may comprise information retrieval results about the named entity men…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06F40/295. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 15 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).