Vision processing and model training method, device, storage medium and program product

US12374140B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12374140-B2
Application numberUS-202318170902-A
CountryUS
Kind codeB2
Filing dateFeb 17, 2023
Priority dateFeb 25, 2022
Publication dateJul 29, 2025
Grant dateJul 29, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure provides a vision processing and model training method, device, storage medium and program product. A specific implementation solution is as follows: establishing an image classification network with the same backbone network as the vision model, performing a self-monitoring training on the image classification network by using an unlabeled first data set; initializing a weight of a backbone network of the vision model according to a weight of a backbone network of the trained image classification network to obtain a pre-training model, the structure of the pre-training model being consistent with that of the vision model, and optimize the weight of the backbone network by using real data set in a current computer vision task scenario, so as to be more suitable for the current computer vision task; then, training the pre-training model by using a labeled second data set to obtain a trained vision model.

First claim

Opening claim text (preview).

What is claimed is: 1. A vision model training method, executed by a processor, comprising: establishing an image classification network, wherein the image classification network has the same backbone network as a vision model; performing a self-monitoring training on the image classification network by using an unlabeled first data set to obtain a trained image classification network; initializing a weight of the backbone network of the vision model according to a weight of the backbone network of the trained image classification network to obtain a pre-training model; training the pre-training model by using a labeled second data set to obtain a trained vision model; and applying the trained vision model to a computer vision task to perform a corresponding computer vision processing to obtain a processing result, wherein the computer vision task comprises target detection, image segmentation, and text recognition, and wherein performing the self-monitoring training on the image classification network by using the unlabeled first data set to obtain the trained image classification network comprises: obtaining the unlabeled first data set, the first data set comprising a plurality of groups of sample images and direction information of each sample image, wherein each group of sample images comprises a first sample image and a second sample image obtained by rotating the first sample image by a preset angle; extracting an image feature of each sample image in the first data set through the image classification network, and determining a direction prediction result of each sample image according to the image feature; calculating a first loss according to the image features of two sample images whose direction information differs by 180 degrees in the same group of sample images; and calculating a second loss according to real direction information and the direction prediction result of each sample image; and adjusting the weight of the backbone network of the image classification network according to the first loss and the second loss. 2. The method according to claim 1 , wherein the obtaining the unlabeled first data set comprises: obtaining an unlabeled first sample image and determining direction information of the first sample image as 0 degrees; rotating the first sample image by the preset angle to obtain the second sample image, and determining direction information of the second sample image as the preset angle. 3. The method according to claim 2 , wherein the preset angle at least comprises 180 degrees, calculating the first loss according to the image features of two sample images whose direction information differs by 180 degrees in the same group of sample images comprises: calculating the first loss according to a difference between an image feature obtained by rotating an image feature of the first sample image by 180 degrees and an image feature of the second sample image obtained by rotating the first sample image by 180 degrees in each group of sample images. 4. The method according to claim 2 , wherein the preset angle at least comprises a first angle and a second angle, the second angle is equal to the first angle plus 180 degrees, and the first angle is not 0 degrees; calculating the first loss according to the image features of two sample images whose direction information differs by 180 degrees in the same group of sample images comprises: calculating the first loss according to a difference between an image feature obtained by rotating an image feature of a sample image whose direction information is the first angle by 180 degrees and an image feature of a sample image whose direction information is the second angle in the same group of sample images. 5. The method according to claim 2 , wherein obtaining the unlabeled first sample image comprises: obtaining an original image, wherein the original image comprises at least one of a synthetic image and a real image; performing a preprocessing on the original image to obtain a sample image meeting a model training requirement; performing a random data augmentation on the sample image to obtain the first sample image. 6. The method according to claim 3 , wherein obtaining the unlabeled first sample image comprises: obtaining an original image, wherein the original image comprises at least one of a synthetic image and a real image; performing a preprocessing on the original image to obtain a sample image meeting a model training requirement; performing a random data augmentation on the sample image to obtain the first sample image. 7. The method according to claim 4 , wherein obtaining the unlabeled first sample image comprises: obtaining an original image, wherein the original image comprises at least one of a synthetic image and a real image; performing a preprocessing on the original image to obtain a sample image meeting a model training requirement; performing a random data augmentation on the sample image to obtain the first sample image. 8. The method according to claim 5 , wherein if the vision model is applied to a text recognition scenario, performing the preprocessing on the original image to obtain the sample image meeting the model training requirement comprises: performing a text detection on the original image, and extracting an image of a region where text information is located; and performing an image correction on the image of the region where the text information is located to obtain the sample image meeting the model training requirement. 9. The method according to claim 6 , wherein if the vision model is applied to a text recognition scenario, performing the preprocessing on the original image to obtain the sample image meeting the model training requirement comprises: performing a text detection on the original image, and extracting an image of a region where text information is located; and performing an image correction on the image of the region where the text information is located to obtain the sample image meeting the model training requirement. 10. The method according to claim 7 , wherein if the vision model is applied to a text recognition scenario, performing the preprocessing on the original image to obtain the sample image meeting the model training requirement comprises: performing a text detection on the original image, and extracting an image of a region where text information is located; and performing an image correction on the image of the region where the text information is located to obtain the sample image meeting the model training requirement. 11. The method according to claim 1 , wherein adjusting the weight of the backbone network of the image classification network according to the first loss and the second loss comprises: calculating a sum of the first loss and the second loss as a final loss; and adjusting the weight of the backbone network of the image classification network according to the final loss. 12. The method according to claim 2 , wherein adjusting the weight of the backbone network of the image classification network according to the first loss and the second loss comprises: calculating a sum of the first loss and the second loss as a final loss; and adjusting the weight of the backbone network of the image classification network according to the final loss. 13. The method according to claim 3 , wherein adjusting the weight of the backbone network of the image classification network according to the first loss and the second loss comprises: calculating a sum of the first loss and the second loss as a final loss; and adjusting the weight of the backbone network of the image

Assignees

Inventors

Classifications

  • Classification techniques · CPC title

  • Extraction of features or characteristics of the image · CPC title

  • Image preprocessing · CPC title

  • Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • using neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12374140B2 cover?
The present disclosure provides a vision processing and model training method, device, storage medium and program product. A specific implementation solution is as follows: establishing an image classification network with the same backbone network as the vision model, performing a self-monitoring training on the image classification network by using an unlabeled first data set; initializing a …
Who is the assignee on this patent?
Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06V30/19147. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 29 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).