Pre-training method, image and text retrieval method for a vision and scene text aggregation model, electronic device, and storage medium

US12347158B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12347158-B2
Application numberUS-202318192393-A
CountryUS
Kind codeB2
Filing dateMar 29, 2023
Priority dateMay 26, 2022
Publication dateJul 1, 2025
Grant dateJul 1, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A pre-training method for a Vision and Scene Text Aggregation model includes: acquiring a sample image-text pair; extracting a sample scene text from a sample image; inputting a sample text into a text encoding network to obtain a sample text feature; inputting the sample image and an initial sample aggregation feature into a visual encoding subnetwork and inputting the initial sample aggregation feature and the sample scene text into a scene encoding subnetwork to obtain a global image feature of the sample image and a learned sample aggregation feature; and pre-training the Vision and Scene Text Aggregation model according to the sample text feature, the global image feature of the sample image, and the learned sample aggregation feature.

First claim

Opening claim text (preview).

What is claimed is: 1. A pre-training method for a Vision and Scene Text Aggregation model, wherein the Vision and Scene Text Aggregation model comprises a text encoding network and a visual scene encoding network, the visual scene encoding network comprises a visual encoding subnetwork and a scene encoding subnetwork, and the method comprises: acquiring a sample image-text pair, wherein the sample image-text pair comprises a sample image and a sample text; extracting a sample scene text from the sample image; inputting the sample text into the text encoding network to obtain a sample text feature; inputting the sample image and an initial sample aggregation feature into the visual encoding subnetwork and inputting the initial sample aggregation feature and the sample scene text into the scene encoding subnetwork to obtain a global image feature of the sample image and a learned sample aggregation feature; and pre-training the Vision and Scene Text Aggregation model according to the sample text feature, the global image feature of the sample image, and the learned sample aggregation feature. 2. The method of claim 1 , wherein inputting the sample image and the initial sample aggregation feature into the visual encoding subnetwork and inputting the initial sample aggregation feature and the sample scene text into the scene encoding subnetwork to obtain the global image feature of the sample image and the learned sample aggregation feature comprise: inputting the sample image into an input layer in the visual encoding subnetwork and inputting the initial sample aggregation feature into an aggregation layer in the visual encoding subnetwork to obtain the global image feature of the sample image outputted from the visual encoding subnetwork and a visual aggregation feature outputted from the visual encoding subnetwork; inputting the sample scene text into an input layer in the scene encoding subnetwork and inputting the initial sample aggregation feature into an aggregation layer in the scene encoding subnetwork to obtain a scene aggregation feature outputted from the scene encoding subnetwork; and aggregating the visual aggregation feature outputted from the visual encoding subnetwork with the scene aggregation feature outputted from the scene encoding subnetwork to obtain the learned sample aggregation feature. 3. The method of claim 1 , wherein inputting the sample text into the text encoding network to obtain the sample text feature comprises: performing word embedding on the sample text to obtain a sample text word vector; determining a word encoding result of the sample text according to modal information of the sample text, position encoding information of the sample text, and the sample text word vector; constructing an encoding sequence of the sample text according to an initial sample text feature and the word encoding result of the sample text; and inputting the encoding sequence of the sample text into the text encoding network to obtain a learned sample text feature. 4. The method of claim 1 , wherein inputting the sample image into the visual encoding subnetwork comprises: dividing the sample image into blocks to obtain a sample image block sequence; performing linear projection on a sample image block in the sample image block sequence to obtain an encoding result of the sample image block; processing the encoding result of the sample image block according to modal information of the sample image block and position encoding information of the sample image block to obtain the processed encoding result of the sample image block; constructing an encoding sequence of the sample image according to an initial global image feature and the processed encoding result of the sample image block; and inputting the encoding sequence of the sample image into an input layer in the visual encoding subnetwork. 5. The method of claim 1 , wherein inputting the sample scene text into the scene encoding subnetwork comprises: performing word embedding on the sample scene text to obtain a sample scene text vector; determining an encoding result of the sample scene text according to image position encoding information of the sample scene text, modal information of the sample scene text, character position encoding information of the sample scene text, and the sample scene text vector; constructing an encoding sequence of the sample scene text according to an initial sample scene text feature and the encoding result of the sample scene text; and inputting the encoding sequence of the sample scene text into an input layer in the scene encoding subnetwork. 6. The method of claim 1 , wherein pre-training the Vision and Scene Text Aggregation model according to the sample text feature, the global image feature of the sample image, and the learned sample aggregation feature comprises: determining an aggregation text contrast loss according to the sample text feature and the learned sample aggregation feature; determining an image text contrast loss according to the global image feature of the sample image and the sample text feature; determining a training loss according to the aggregation text contrast loss and the image text contrast loss; and pre-training the Vision and Scene Text Aggregation model by using the training loss. 7. The method of claim 6 , wherein determining the training loss according to the aggregation text contrast loss and the image text contrast loss comprises: determining whether the sample scene text is an empty text or a non-empty text; and in a case where the sample scene text is the empty text, using the image text contrast loss as the training loss; and in a case where the sample scene text is the non-empty text, using a sum of the aggregation text contrast loss and the image text contrast loss as the training loss. 8. An electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores an instruction executable by the at least one processor to enable the at least one processor to perform the method of claim 1 . 9. A non-transitory computer-readable storage medium, which is configured to store a computer instruction for causing a computer to perform the method of claim 1 . 10. A training method for a Vision and Scene Text Aggregation model, comprising: acquiring a service image-text pair provided by a service party, wherein the service image-text pair comprises a service image and a service text; and finely adjusting the Vision and Scene Text Aggregation model by using the service image and the service text as training data, wherein the Vision and Scene Text Aggregation model is obtained based on a pre-training method for a Vision and Scene Text Aggregation model, wherein the Vision and Scene Text Aggregation model comprises a text encoding network and a visual scene encoding network, the visual scene encoding network comprises a visual encoding subnetwork and a scene encoding subnetwork, and the pre-training method comprises: acquiring a sample image-text pair, wherein the sample image-text pair comprises a sample image and a sample text; extracting a sample scene text from the sample image; inputting the sample text into the text encoding network to obtain a sample text feature; inputting the sample image and an initial sample aggregation feature into the visual encoding subnetwork and inputting the initial sample aggregation feature and the sample scene text into the scene encoding subnetwork to obtain a global image feature of the sample image and a learned sample aggregation feature; and pre-training the Vision and Scene Text Aggregation model according to the sample text

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12347158B2 cover?
A pre-training method for a Vision and Scene Text Aggregation model includes: acquiring a sample image-text pair; extracting a sample scene text from a sample image; inputting a sample text into a text encoding network to obtain a sample text feature; inputting the sample image and an initial sample aggregation feature into a visual encoding subnetwork and inputting the initial sample aggregati…
Who is the assignee on this patent?
Beijing Baidu Netcom Sci & Tech Co Ltd, Beijing Baidu Netcom Science Tech Co Ltd China
What technology area does this patent fall under?
Primary CPC classification G06F16/332. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).