Annotation data determination method and apparatus, and readable medium and electronic device

US12405987B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12405987-B2
Application numberUS-202218552781-A
CountryUS
Kind codeB2
Filing dateMar 17, 2022
Priority dateMar 31, 2021
Publication dateSep 2, 2025
Grant dateSep 2, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure relates to an annotation data determination method and apparatus, and a readable medium and an electronic device. By means of the present disclosure, high-quality data to be annotated is obtained for model performance evaluation. The method includes: acquiring candidate data from a candidate data set; respectively inputting the candidate data into a first text recognition model and a second text recognition model, so as to obtain a first recognition result output by the first text recognition model and a second recognition result output by the second text recognition model, wherein both the first text recognition model and the second text recognition model can recognize whether text data is of a target category; according to the first recognition result and the second recognition result, determining whether the candidate data meets an annotation condition, wherein the annotation condition is the category of the candidate data being recognized by the first text recognition model or the second text recognition model as at least one target category among target categories; and if it is determined that the candidate data meets the annotation condition, determining the candidate data as text data to be annotated.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for determining labeled data, comprising steps of: obtaining candidate data from a candidate data set, wherein the candidate data set is a set constituted by a plurality of unlabeled text data; inputting the candidate data into a first text recognition model and a second text recognition model respectively to obtain a first recognition result output from the first text recognition model and a second recognition result output from the second text recognition model, wherein the first text recognition model and the second text recognition model are both capable of recognizing whether text data belongs to a target category; determining whether the candidate data meets a labeling condition according to the first recognition result and the second recognition result, wherein the labeling condition is that the candidate data is recognized by at least one of the first text recognition model or the second text recognition model as belonging to the target category; determining the candidate data as text data needing to be labeled if it is determined that the candidate data meets the labeling condition; and determining the candidate data as text data not needing to be labeled if it is determined that the candidate data does not meet the labeling condition, wherein the first recognition result is a first score output from the first text recognition model for the candidate data, the second recognition result is a second score output from the second text recognition model for the candidate data, the determining whether the candidate data meets the labeling condition according to the first recognition result and the second recognition result comprises: determining that the candidate data meets the labeling condition if the first score is greater than or equal to a score threshold, or if the second score is greater than or equal to the score threshold, and wherein the score threshold is determined by the following steps: determining whether the text data meets the labeling condition for each text data in the candidate data set according to the first text recognition model, the second text recognition model and a target score used this time; increasing the target score if a number of text data in the candidate data set that meets the labeling condition is greater than a maximum sampling number; performing the step of determining whether the text data meets the labeling condition for each text data in the candidate data set again based on the increased target score and determining whether a number of text data in the candidate data set that meets the labeling condition is greater than the maximum sampling number; and determining the increased target score as the score threshold if the number of text data in the candidate data set that meets the labeling condition is less than or equal to the maximum sampling number. 2. The method for determining labeled data according to claim 1 , wherein the first recognition result and the second recognition result are both configured to indicate whether the candidate data belongs to the target category, and the determining whether the candidate data meets the labeling condition according to the first recognition result and the second recognition result comprises: determining that the candidate data meets the labeling condition if the first identification result indicates that the candidate data belongs to the target category, or if the second identification result indicates that the candidate data belongs to the target category. 3. The method for determining labeled data according to claim 1 , wherein the determining whether the text data meets the labeling condition for each text data in the candidate data set according to the first text recognition model, the second text recognition model and the target score used this time comprises: inputting the text data into the first text recognition model and the second text recognition model to obtain a third score output from the first text recognition model and a fourth score output from the second text model; and determining that the text data meets the labeling condition if the third score is greater than or equal to the target score, or if the fourth score is greater than or equal to the target score. 4. The method for determining labeled data according to claim 1 , further comprising, after the determining the candidate data as the text data to be labeled: repeating the steps until any of following two conditions is satisfied: all text data in the candidate data set are traversed; or a number of text data to be labeled reaches a preset sampling number. 5. The method for determining labeled data according to claim 1 , further comprising: obtaining labeling information for the text data to be labeled; labeling the text data to be labeled by using the labeling information to obtain labeled data; and adding the labeled data to an evaluation data set for performing model evaluation on the first text recognition model and the second text recognition model. 6. A non-transitory computer-readable medium having a computer program stored thereon that, when executed by a processing device, implements a method for determining labeled data, comprising: obtaining candidate data from a candidate data set, wherein the candidate data set is a set constituted by a plurality of unlabeled text data; inputting the candidate data into a first text recognition model and a second text recognition model respectively to obtain a first recognition result output from the first text recognition model and a second recognition result output from the second text recognition model, wherein the first text recognition model and the second text recognition model are both capable of recognizing whether text data belongs to a target category; determining whether the candidate data meets a labeling condition according to the first recognition result and the second recognition result, wherein the labeling condition is that the candidate data is recognized by at least one of the first text recognition model or the second text recognition model as belonging to the target category; determining the candidate data as text data needing to be labeled if it is determined that the candidate data meets the labeling condition; and determining the candidate data as text data not needing to be labeled if it is determined that the candidate data does not meet the labeling condition, wherein the first recognition result is a first score output from the first text recognition model for the candidate data, the second recognition result is a second score output from the second text recognition model for the candidate data, and the computer program implements following steps: determining that the candidate data meets the labeling condition if the first score is greater than or equal to a score threshold, or if the second score is greater than or equal to the score threshold, and wherein the score threshold is determined by the following steps: determining whether the text data meets the labeling condition for each text data in the candidate data set according to the first text recognition model, the second text recognition model and a target score used this time; increasing the target score if a number of text data in the candidate data set that meets the labeling condition is greater than a maximum sampling number, performing the determining whether the text data meets the labeling condition for each text data in the candidate data set, the second text recognition model and a target score used this time again based on the increased target score, and determining whether a number of text data in the candidate data set that meets the labeling condition is greater than the maximum sampling number; and determining the increase

Assignees

Inventors

Classifications

  • Filtering based on additional data, e.g. user or group profiles (filtering in web context G06F16/9535, G06F16/9536) · CPC title

  • G06F16/353Primary

    into predefined classes · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12405987B2 cover?
The present disclosure relates to an annotation data determination method and apparatus, and a readable medium and an electronic device. By means of the present disclosure, high-quality data to be annotated is obtained for model performance evaluation. The method includes: acquiring candidate data from a candidate data set; respectively inputting the candidate data into a first text recognition…
Who is the assignee on this patent?
Beijing Bytedance Network Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F16/353. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).