Systems and methods for labeling source data using confidence labels

US9704106B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9704106-B2
Application numberUS-201615166609-A
CountryUS
Kind codeB2
Filing dateMay 27, 2016
Priority dateJun 22, 2012
Publication dateJul 11, 2017
Grant dateJul 11, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for the annotation of source data using confidence labels in accordance embodiments of the invention are disclosed. In one embodiment of the invention, a method for determining confidence labels for crowdsourced annotations includes obtaining a set of source data, obtaining a set of training data representative of the set of source data, determining the ground truth for each piece of training data, obtaining a set of training data annotations including a confidence label, measuring annotator accuracy data for at least one piece of training data, and automatically generating a set of confidence labels for the set of unlabeled data based on the measured annotator accuracy data and the set of annotator labels used.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for determining labels for crowdsourced annotations, comprising: obtaining a set of source data using a distributed data annotation server system comprising a processor and a memory readable by the processor, where the source data comprises a set of unlabeled data; obtaining a set of training data using the distributed data annotation server system, where the set of training data comprises a subset of the source data representative of the set of source data; determining ground truth data describing the ground truth for each piece of training data in the set of training data using the distributed data annotation server system, where the ground truth data for a piece of training data describes the content of the piece of data; providing the sets of annotator data to a plurality of data annotation devices, wherein a data annotation device: obtains the set of annotator data; generates: a set of annotation data based on the obtained set of annotator data; and wherein the set of annotation data comprises at least one source data annotation applied to a piece of source data and one training data annotation applied to a piece of training data, where a training data annotation comprises data describing the piece of training data and a confidence label selected from a set of confidence labels describing a measure of confidence in the accuracy of the data describing the piece of training data; and transmits the set of annotation data; obtaining the sets of annotation data from the plurality of data annotation devices using the distributed data annotation server system; calculating annotator accuracy data for each data annotation device for at least one piece of training data in the set of training data based on the ground truth for each piece of training source data and the set of training data annotations using the distributed data annotation server system, wherein the annotator accuracy data describes the accuracy of annotation data provided by a particular data annotation device based on the accuracy and confidence indicated for one or more pieces of training data provided to the particular data annotation device; and automatically generating a set of labels for each piece of unlabeled data in the set of source data based on the calculated annotator accuracy data and the set of annotator labels received from the plurality of data annotation devices using the distributed data annotation server system. 2. The method of claim 1 , further comprising, wherein the annotating one or more pieces of source data based on the set of labels using the distributed data annotation server system. 3. The method of claim 1 , wherein at least one of the plurality of data annotation devices is implemented using the distributed data annotation server system. 4. The method of claim 1 , wherein at least one of the plurality of data annotation devices is implemented using human intelligence tasks. 5. The method of claim 1 , further comprising determining the number of confidence labels to provide in the sets of annotator data based on the set of training data using the distributed data annotation server system. 6. The method of claim 5 , wherein the number of confidence labels is determined by calculating the number of confidence labels that maximizes the amount of information obtained from the confidence labels using the distributed data annotation server system. 7. The method of claim 1 , further comprising determining a set of labeling tasks using the distributed data annotation server system, where a labeling tasks instructs an annotator to provide a label describing at least one feature of a piece of source data. 8. The method of claim 1 , further comprising determining rewards based on the annotator accuracy data using the distributed data annotation server system. 9. The method of claim 8 , wherein the rewards are determined by calculating a reward matrix using the distributed data annotation server system, where the reward matrix specifies a reward to be awarded to a particular confidence label based on the ground truth of the piece of source data that is targeted by the confidence label. 10. The method of claim 8 , wherein the reward for annotating a piece of source data is based on the difficulty of the piece of source data, where the difficulty of a piece of source data is determined based on a set of annotations provided for the source data and a ground truth value associated with the piece of source data. 11. The method of claim 8 , further comprising: generating labeling threshold data based on the training data annotations and the calculated annotator accuracy using the distributed data annotation server system, where the labeling threshold data provides guidance to a data annotation device regarding the meaning of one or more confidence labels in the set of confidence labels; providing the labeling threshold data along with the set of training data to a data annotation device using the distributed data annotation server system; and generating feedback based on annotations provided by the data annotation device based on the labeling threshold data and the set of training data using the distributed data annotation server system, where the feedback directs the data annotation device to utilize the labeling threshold data in the annotation of source data. 12. The method of claim 1 , wherein each confidence label in the set of confidence labels comprises a confidence interval identified based on the calculated annotator accuracy data and the distribution of the set of annotator labels within the pieces of training data in the set of training data. 13. A distributed data annotation server system, comprising: a processor; and a memory connected to the processor and storing a data annotation application; wherein the data annotation application directs the processor to: obtain a set of source data, where the source data comprises a set of unlabeled data; obtain a set of training data, where the set of training data comprises a subset of the source data representative of the set of source data; determine ground truth data describing the ground truth for each piece of training data in the set of training data, where the ground truth data for a piece of training data describes the content of the piece of training data; generate sets of annotator data based on the set of source data and the set of training data, where a set of annotator data comprises at least one piece of source data selected from the set of source data and at least one piece of training data selected from the set of training data; provide the sets of annotator data to a plurality of data annotation devices; obtain sets of annotation data from the plurality of data annotation devices, where a set of annotation data comprises at least one source data annotation applied to a piece of source data and one training data annotation applied to a piece of training data, where a training data annotation comprises data describing the piece of training data and a confidence label selected from a set of confidence labels describing a measure of confidence in the accuracy of the data describing the piece of training data; calculate annotator accuracy data for each data annotation device for at least one piece of training data in the set of training data based on the ground truth for each piece of training source data and the set of training data annotations; and automatically generate a set of labels for each piece of unlabeled data in the set of source data based on the calculated annotator accuracy data and the set of annotator labels received from the plurality of data annot

Assignees

Inventors

Classifications

  • Machine learning · CPC title

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • the supervisor being an automated module, e.g. intelligent oracle · CPC title

  • Interactive pattern learning with a human teacher · CPC title

  • using data annotations, e.g. user-defined metadata · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9704106B2 cover?
Systems and methods for the annotation of source data using confidence labels in accordance embodiments of the invention are disclosed. In one embodiment of the invention, a method for determining confidence labels for crowdsourced annotations includes obtaining a set of source data, obtaining a set of training data representative of the set of source data, determining the ground truth for each…
Who is the assignee on this patent?
California Inst Of Techn
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 11 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).