Unified Vision and Dialogue Transformer with BERT
US-2021232773-A1 · Jul 29, 2021 · US
US12424010B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12424010-B2 |
| Application number | US-202318168759-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 14, 2023 |
| Priority date | Aug 16, 2022 |
| Publication date | Sep 23, 2025 |
| Grant date | Sep 23, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure provides a character recognition model training method and apparatus, a character recognition method and apparatus, a device and a medium, relating to the technical field of artificial intelligence, and specifically to the technical fields of deep learning, image processing and computer vision, which can be applied to scenarios such as character detection and recognition technology. The specific implementing solution is: partitioning an untagged training sample into at least two sub-sample images; dividing the at least two sub-sample images into a first training set and a second training set; where the first training set includes a first sub-sample image with a visible attribute, and the second training set includes a second sub-sample image with an invisible attribute; performing self-supervised training on a to-be-trained encoder by taking the second training set as a tag of the first training set, to obtain a target encoder.
Opening claim text (preview).
What is claimed is: 1. A character recognition method being applied to a server and comprising: partitioning an untagged training sample into at least two sub-sample images; dividing the at least two sub-sample images into a first training set and a second training set; wherein the first training set comprises a first sub-sample image with a visible attribute, and the second training set comprises a second sub-sample image with an invisible attribute; performing self-supervised training on a to-be-trained encoder by taking the second training set as a tag of the first training set, to obtain a target encoder; wherein the performing the self-supervised training on the to-be-trained encoder by taking the second training set as the tag of the first training set, to obtain the target encoder comprises: initializing the to-be-trained encoder to obtain a first encoder; extracting, based on the first encoder, a first visual feature of the first sub-sample image in the first training set and a second visual feature of the second sub-sample image in the second training set; performing mask query calculation on the first visual feature, to obtain a third visual feature; and updating the first encoder according to a feature error between the third visual feature and the second visual feature until the feature error satisfies a first error condition, and determining a latest updated first encoder as the target encoder; wherein the updating the first encoder according to the feature error between the third visual feature and the second visual feature until the feature error satisfies the first error condition, and the determining the latest updated first encoder as the target encoder comprise: initializing a to-be-trained decoder to obtain a first decoder; determining, based on the first decoder, an image error generated when image reconstruction is performed on the third visual feature; determining the feature error between the third visual feature and the second visual feature; and updating the first encoder based on the feature error and the image error and updating the first decoder based on the image error until the feature error satisfies the first error condition and the image error satisfies a second error condition, and determining a latest obtained first encoder as the target encoder; receiving a to-be-recognized image sent by a terminal device, and performing, based on the target encoder and the updated first decoder, image features extraction on the to-be-recognized image to obtain a target text; and sending the target text to the terminal device. 2. The method according to claim 1 , wherein the determining, based on the first decoder, the image error generated when the image reconstruction is performed on the third visual feature comprises: performing decoding calculation processing on the third visual feature by using the first decoder, to obtain a first decoded feature; and obtaining the image error according to an image reconstruction result of the first decoded feature. 3. The method according to claim 2 , wherein the obtaining the image error according to the image reconstruction result of the first decoded feature comprises: performing image reconstruction processing on the first decoded feature, to obtain a first prediction result; and performing image error calculation by using the second sub-sample image and the first prediction result, to obtain the image error. 4. The method according to claim 1 , further comprising: dividing, based on a mask setting strategy, at least two query vectors into a first query vector and a second query vector; wherein the mask setting strategy comprises mask data generated based on a preset first mask ratio; the at least two query vectors are spatial transformation vectors corresponding to a basis character string; the performing the mask query calculation on the first visual feature, to obtain the third visual feature comprises: obtaining, based on feature prediction calculation of the second query vector and the first visual feature, a feature vector corresponding to an occurrence probability of the first visual feature in the second query vector; and performing vector combination on the feature vector corresponding to the first visual feature, to obtain the third visual feature. 5. The method according to claim 1 , wherein the dividing the at least two sub-sample images into the first training set and the second training set comprises: dividing the at least two sub-sample images into the first training set and the second training set by using a mask setting strategy. 6. A character recognition apparatus comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores an instruction executable by the at least one processor, and the instruction is executed by the at least one processor to cause the at least one processor to perform the method according to claim 1 . 7. A non-transitory computer-readable storage medium storing a computer instruction, wherein the computer instruction is used to cause a computer to perform the method according to claim 1 . 8. A character recognition model training method comprising: partitioning a synthetic sample into at least two sub-synthetic images, wherein the synthetic sample comprises a synthetic text tag; dividing the at least two sub-synthetic images as a first synthetic set and a second synthetic set; wherein the first synthetic set comprises a first sub-synthetic image with a visible attribute, and the second synthetic set comprises a second sub-synthetic image with an invisible attribute; and performing, based on the first synthetic set and the second synthetic set, supervised training on a to-be-trained decoder to obtain a target decoder corresponding to the to-be-trained decoder; wherein the performing, based on the first synthetic set and the second synthetic set, the supervised training on the to-be-trained decoder to obtain the target decoder corresponding to the to-be-trained decoder comprises: extracting, based on a target encoder, a first feature sequence of the first sub-synthetic image in the first synthetic set; wherein the target encoder is obtained by performing following steps: partitioning an untagged training sample into at least two sub-sample images; dividing the at least two sub-sample images into a first training set and a second training set; wherein the first training set comprises a first sub-sample image with a visible attribute, and the second training set comprises a second sub-sample image with an invisible attribute; and performing self-supervised training on a to-be-trained encoder by taking the second training set as a tag of the first training set, to obtain the target encoder; performing feature completion on the first feature sequence according to an image position, in the synthetic sample, of the second sub-synthetic image in the second synthetic set, to obtain a second feature sequence; and training, by taking that a predictive text of the second feature sequence predicted by the to-be-trained decoder is the same as a synthetic text of the second sub-synthetic image in the synthetic text tag as a training objective, to obtain the target decoder corresponding to the to-be-trained decoder. 9. The method according to claim 8 , wherein the training, by taking that the predictive text of the second feature sequence predicted by the to-be-trained decoder is same as the synthetic text of the second sub-synthetic image in the synthetic text tag as the training objective, to obtain the target decoder corresponding to the to-be-trained decoder comprises: initializing the to-be-trained decoder to obtain a second decoder
Querying · CPC title
Dividing image into blocks, subimages or windows · CPC title
Image segmentation details · CPC title
Training; Learning · CPC title
Querying · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.