Who is the assignee on this patent?

Microsoft Corp, Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06F40/242. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 28 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Active labeling for computer-human interactive learning

US9582490B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9582490-B2
Application number	US-201314075690-A
Country	US
Kind code	B2
Filing date	Nov 8, 2013
Priority date	Jul 12, 2013
Publication date	Feb 28, 2017
Grant date	Feb 28, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A collection of data that is extremely large can be difficult to search and/or analyze. Relevance may be dramatically improved by automatically classifying queries and web pages in useful categories, and using these classification scores as relevance features. A thorough approach may require building a large number of classifiers, corresponding to the various types of information, activities, and products. Creation of classifiers and schematizers is provided on large data sets. Exercising the classifiers and schematizers on hundreds of millions of items may expose value that is inherent to the data by adding usable meta-data. Some aspects include active labeling exploration, automatic regularization and cold start, scaling with the number of items and the number of classifiers, active featuring, and segmentation and schematization.

First claim

Opening claim text (preview).

The invention claimed is: 1. One or more hardware memory having embodied thereon computer-usable instructions that, when executed, facilitate a method of interactively labeling training data for machine learning, the method comprising: generating a first set of data items; alternating between repetitions of labeling sampled data items and repetitions of labeling searched data items; wherein labeling sampled data items includes A) providing a first set of data items, wherein each data item is scored with a probability of being a positive example of a particular class of data items, B) presenting a second set of one or more data items on a user interface, wherein the second set of data items is generated by sampling, from the first set of data items, one or more data items having scores lying within a range of scores selected to a) optimize precision of the classifier, wherein the scores selected to optimize precision of the classifier are greater than 0.5 on a scale of zero to one, or b) optimize recall of the classifier, C) receiving, via the user interface, one or more user-provided labels that identify one or more data items in the second set as positive or negative examples of the particular class of data items, and D) training the classifier based on the user-provided labels in the second set of data items; and wherein labeling searched data items includes A) presenting, on the user interface, a third set of one or more data items selected from the first set of data items based on a user-provided search query, B) receiving, via the user interface, one or more user-provided labels that identify one or more data items in the third set as positive or negative examples of the particular class of data items; and C) training the classifier based on the user-provided labels in the third set of data items. 2. The one or more hardware memory of claim 1 , wherein training the classifier occurs automatically and is transparent to a user. 3. The one or more hardware memory of claim 1 , wherein subsequent to training the classifier each data item in the first set of data items is rescored with a probability of being a positive example of the particular class of data items. 4. The one or more hardware memory of claim 1 , wherein the scores selected to optimize precision of the classifier lie within a distribution around a probability of 0.75 on a scale of zero to one. 5. The one or more hardware memory of claim 1 , wherein the scores selected to optimize recall of the classifier are less than 0.5 on a scale of zero to one. 6. The one or more hardware memory of claim 5 , wherein the scores selected to optimize recall of the classifier lie within a distribution around a probability of 0.25 on a scale of zero to one. 7. The one or more hardware memory of claim 1 , wherein the first set of data items includes data items randomly sampled from a larger set of data items. 8. The one or more hardware memory of claim 7 , wherein prior to scoring the first set of data items with the classifier, the classifier is pre-trained based on data items that are results of a user query. 9. The one or more hardware memory of claim 7 , wherein the data items randomly sampled from the larger set of data items are initially unlabeled. 10. One or more hardware memory having embodied thereon computer-usable instructions that, when executed, facilitate a method of interactively labeling training data for machine learning, the method comprising: providing a working set of data items; training a classifier using a predefined set of positive or negative examples of a particular class of data items, wherein the classifier is trained to score a data item with a probability of being an example of the particular class of data item; scoring the working set of data items via the classifier; generating a sampled set of data items by sampling the working set of data items within a selected range of data item scores, wherein the selected range of data item scores is selected to optimize either precision of the classifier or recall of the classifier; presenting the sampled set of data items within a user interface; receiving, via the user interface, one or more user-provided labels for one or more of the presented data items, wherein the one or more user-provided labels identify the one or more of the presented data items as positive or negative examples of the class of data items; retraining the classifier based on the one or more user-provided labels; and repeating steps of presenting data items, receiving user-provided labels, and retraining the classifier, wherein the presenting data items alternates between presenting sampled sets of data items and presenting data items based on results of a user-provided search query, and wherein the selected range of data item scores is alternated between the range of data item scores selected to optimize the precision of the classifier and the range of data item scores selected to optimize the recall of the classifier. 11. The one or more hardware memory of claim 10 , wherein subsequent to retraining the classifier, the working set of data items is scored with the retrained classifier. 12. The one or more hardware memory of claim 10 , wherein the predefined set of positive or negative examples is based on search results of a search engine. 13. The one or more hardware memory of claim 10 , wherein the range of data item scores selected to optimize the precision of the classifier is within a range of scores having a probability of greater than 0.5 of being positive examples on a scale of zero to one. 14. The one or more hardware memory of claim 13 , wherein the range of data item scores selected to optimize the recall of the classifier is within a range of scores having a probability of less than 0.5 of being positive examples on a scale of zero to one. 15. One or more hardware memory having embodied thereon computer-usable instructions that, when executed, facilitate a method of interactively labeling training data for machine learning, the method comprising: generating a first set of data items; identifying data items, from within the first set of data items, as belonging to a particular class of data items, based on a user search query; training a classifier based at least on the data items that were identified as belonging to the particular class of data items; scoring the first set of data items with the classifier, wherein the classifier scores each data item with a probability of being a positive example of the particular class of data items; selecting, from the first set of data items, a second set of one or more data items based on the scoring, wherein the one or more data items in the second set are selected A) to improve precision of the classifier, wherein the selected one or more data items have scores that lie within a distribution around a probability of 0.75 on a scale of zero to one of being a positive example of the particular class of data items, or B) to improve recall of the classifier, wherein the selected one or more data items have scores that lie within a distribution around a probability of 0.25 on a scale of zero to one of being a positive example of the particular class of data items; presenting the second set of one or more data items on a user interface; receiving, via the user interface, one or more user-provided labels that identify one or more of the data items in the second set as positive or negative examples of the particular class of data items, wherein the one or more user-provided labels are received by means of one or more of a text entry field, a checkbo

Assignees

Inventors

Classifications

G06N7/01
Probabilistic graphical models, e.g. probabilistic networks · CPC title
H04L1/0079
Formats for control data (H04L1/16 takes precedence; training sequences H04L25/00 and H04L27/00) · CPC title
G06F40/242Primary
Dictionaries · CPC title
G06F16/951
Indexing; Web crawling techniques · CPC title
G06F40/30
Semantic analysis · CPC title

Patent family

Related publications grouped by family.

View patent family 52277796

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9582490B2 cover?: A collection of data that is extremely large can be difficult to search and/or analyze. Relevance may be dramatically improved by automatically classifying queries and web pages in useful categories, and using these classification scores as relevance features. A thorough approach may require building a large number of classifiers, corresponding to the various types of information, activities, a…
Who is the assignee on this patent?: Microsoft Corp, Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06F40/242. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 28 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).