Data labeling method based on artificial intelligence, apparatus and storage medium

US12283085B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12283085-B2
Application numberUS-202217902323-A
CountryUS
Kind codeB2
Filing dateSep 2, 2022
Priority dateMar 31, 2022
Publication dateApr 22, 2025
Grant dateApr 22, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Provided is a data labeling method based on artificial intelligence, an apparatus, and a storage medium relating to the field of artificial intelligence, particularly data labeling, image recognition, and natural language processing. The method includes: determining a plurality of samples involved in clustering; performing a plurality of following operations circularly to realize iterative processing, until a convergence condition is satisfied or a quantity of iterations reaches a number threshold, comprising: pre-clustering the plurality of samples according to a vector representation of the respective samples to obtain a plurality of class clusters, each class cluster containing at least one sample; receiving labeling information for the respective class clusters and re-determining the plurality of samples according to the labeling information; and determining a clustering result according to the labeling information for the respective class clusters.

First claim

Opening claim text (preview).

What is claimed is: 1. A data labeling method based on artificial intelligence, comprising: determining a plurality of samples involved in clustering; performing a plurality of following operations circularly to realize iterative processing, until a convergence condition is satisfied, or a quantity of iterations reaches a number threshold, comprising: pre-clustering the plurality of samples involved in clustering, according to a vector representation of the respective samples involved in clustering, to obtain a plurality of class clusters, wherein each class cluster contains at least one sample involved in clustering; receiving labeling information for the respective class clusters, wherein the labeling information for the respective class clusters comprises: at least one sub-cluster contained in the respective class clusters, and a representative sample in each sub-cluster, wherein the sub-cluster comprises one representative sample and at least one non-representative sample; re-determining the plurality of samples involved in clustering, according to the labeling information by: taking the representative sample in the sub-cluster in the labeling information for the respective class clusters, as the re-determined plurality of samples involved in clustering; for the representative sample, determining a non-representative sample that belongs to, in a previous iteration process, a same sub-cluster as the representative sample; and determining a sub-cluster to which the non-representative sample belongs in a current iteration process, to be the same as a sub-cluster to which the representative sample belongs in the current iteration process; and determining a clustering result according to the labeling information for the respective class clusters. 2. The method of claim 1 , wherein pre-clustering the plurality of samples involved in clustering, according to the vector representation of the respective samples involved in clustering, comprises: pre-clustering the plurality of samples involved in clustering, by using a cluster algorithm in combination with a restriction condition, to enable the respective class clusters obtained by the pre-clustering to satisfy the restriction condition. 3. The method of claim 2 , wherein the restriction condition comprises at least one of: that a quantity of samples involved in clustering contained in each class cluster is not greater than a sample number threshold; or that respective samples involved in clustering contained in each class cluster belong to, in a pre-clustering process of a last iterative processing, different class clusters. 4. The method of claim 3 , wherein pre-clustering the plurality of samples involved in clustering, by using the cluster algorithm in combination with the restriction condition, comprises: determining a density of the respective samples involved in clustering; and performing following operations for the respective samples involved in clustering respectively, according to a descending order of densities: determining a plurality of neighboring samples of a sample involved in clustering; and traversing the respective neighboring samples in sequence according to a descending order of similarities between the respective neighboring samples and the sample involved in clustering, wherein the sample involved in clustering is added to a class cluster to which a neighboring sample belongs, in a case of all of first judgment conditions are satisfied, wherein the first judgment conditions comprise: that a density of the neighboring sample is greater than a density of the sample involved in clustering; that the class cluster to which the neighboring sample belongs exists; that a similarity between the neighboring sample and the sample involved in clustering is greater than or equal to a similarity threshold; that a quantity of samples contained in the class cluster to which the neighboring sample belongs is less than the sample number threshold; and that the neighboring sample and the sample involved in clustering belong to, in the pre-clustering process of the last iterative processing, different class clusters. 5. The method of claim 4 , further comprising: establishing a new class cluster, in a case of at least one of the first judgment conditions is not satisfied, the new class cluster including the sample involved in clustering. 6. The method of claim 3 , wherein pre-clustering the plurality of samples involved in clustering, by using the cluster algorithm in combination with the restriction condition, comprises: selecting a part from the plurality of samples involved in clustering; taking each selected sample involved in clustering as a cluster center; and for each sample involved in clustering other than the cluster center, adding the sample involved in clustering to a class cluster to which a nearest cluster center belongs, in a case of all of second judgment conditions are satisfied, wherein the second judgment conditions comprise: that a quantity of samples contained in the class cluster to which the nearest cluster center belongs is less than the sample number threshold; and that the class cluster to which the nearest cluster center belongs does not include a sample, wherein the sample belongs to, in the pre-clustering process of the last iterative processing, a same class cluster as the sample involved in clustering. 7. The method of claim 6 , further comprising: adding the sample involved in clustering to a class cluster to which another cluster center belongs, in a case of at least one of the second judgment conditions is not satisfied. 8. The method of claim 3 , wherein the convergence condition comprises that a quantity of samples contained in the respective class clusters is less than the sample number threshold. 9. The method of claim 3 , wherein the number threshold is determined by the sample number threshold and a quantity of samples involved in clustering in a first iteration process. 10. The method of claim 1 , wherein the sample involved in clustering comprises an image sample or a text sample. 11. An electronic apparatus, comprising: at least one processor; and a memory connected in communication with the at least one processor, wherein the memory stores an instruction executable by the at least one processor, and the instruction, when executed by the at least one processor, enables the at least one processor to execute: determining a plurality of samples involved in clustering; performing following operations circularly to realize iterative processing, until a convergence condition is satisfied, or a quantity of iterations reaches a number threshold: pre-clustering the plurality of samples involved in clustering, according to a vector representation of the respective samples involved in clustering, to obtain a plurality of class clusters, wherein each class cluster contains at least one sample involved in clustering; receiving labeling information for the respective class clusters, wherein the labeling information for the respective class clusters comprises: at least one sub-cluster contained in the respective class clusters, and a representative sample in each sub-cluster, wherein the sub-cluster comprises one representative sample and at least one non-representative sample; re-determining the plurality of samples involved in clustering, according to the labeling information by: taking the representative sample in the sub-cluster in the labeling information for the respective class clusters, as the re-determined plurality of samples involved in clustering; for the representative sample, determining a non-representative sample that belongs to, in a previous iteration process, a

Assignees

Inventors

Classifications

  • using classification, e.g. of video objects · CPC title

  • G06V10/761Primary

    Proximity, similarity or dissimilarity measures · CPC title

  • Clustering or classification · CPC title

  • Clustering; Classification · CPC title

  • Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12283085B2 cover?
Provided is a data labeling method based on artificial intelligence, an apparatus, and a storage medium relating to the field of artificial intelligence, particularly data labeling, image recognition, and natural language processing. The method includes: determining a plurality of samples involved in clustering; performing a plurality of following operations circularly to realize iterative proc…
Who is the assignee on this patent?
Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06V10/761. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 22 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).