Training method of image-text matching model, bi-directional search method, and relevant apparatus

US11699298B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11699298-B2
Application numberUS-202117349904-A
CountryUS
Kind codeB2
Filing dateJun 16, 2021
Priority dateSep 12, 2017
Publication dateJul 11, 2023
Grant dateJul 11, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This application relates to the field of artificial intelligence technologies, and in particular, to a training method of an image-text matching model, a bi-directional search method, and a relevant apparatus. The training method includes extracting a global feature and a local feature of an image sample; extracting a global feature and a local feature of a text sample; training a matching model according to the extracted global feature and local feature of the image sample and the extracted global feature and local feature of the text sample, to determine model parameters of the matching model; and determining, by the matching model, according to a global feature and a local feature of an inputted image and a global feature and a local feature of an inputted text, a matching degree between the image and the text.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: extracting a global feature of an image sample, and extracting a local feature of the image sample; extracting a global feature of a text sample, and extracting a local feature of the text sample; and training an image-text matching model according to the global feature and the local feature of the image sample and according to the global feature and the local feature of the text sample, wherein the global feature of the image sample includes a feature of a fully connected layer of a convolutional neural network of the image sample, and wherein each node of the fully connected layer is connected to all nodes of a previous layer of the convolutional neural network. 2. The method of claim 1 , further comprising: determining, via the image-text matching model, and according to a global feature and a local feature of an inputted image and a global feature and a local feature of an inputted text, a matching degree between the inputted image and the inputted text. 3. The method of claim 2 , further comprising: extracting the global feature and the local feature of the inputted image or the global feature and the local feature of the inputted text; and sending the global feature and the local feature of the inputted image or the global feature and the local feature of the inputted text to a server, for the server to determine the matching degree between the inputted image and the inputted text; or sending the inputted image or the inputted text to the server, for the server to extract the global feature and the local feature of the inputted image or the global feature and the local feature of the inputted text and then for the server to determine the matching degree between the inputted image and the inputted text. 4. The method of claim 2 , further comprising: mapping the global feature of the inputted image and the global feature of the inputted text via the image-text matching model into a specified semantic space, to calculate a similarity between the global feature of the inputted image and the global feature of the inputted text; mapping the local feature of the inputted image and the local feature of the inputted text into the specified semantic space, to calculate a similarity between the local feature of the inputted image and the local feature of the inputted text; and determining the matching degree between the inputted image and the inputted text according to the similarity between the global feature of the inputted image and the global feature of the inputted text and according to the similarity between the local feature of the inputted image and the local feature of the inputted text. 5. The method of claim 1 , wherein the local feature of the image sample is extracted by: dividing the image sample into a quantity of image blocks; calculating probabilities, respectively corresponding to the quantity of image blocks, that each of the quantity of image blocks includes a specified category of image information; and selecting a maximum probability from the probabilities, features of an image block of the quantity of image blocks corresponding to the maximum probability are regarded as the local feature of the image sample. 6. The method of claim 1 , wherein the global feature of the text sample is extracted by: performing word segmentation on the text sample to obtain word segments; determining vectors respectively corresponding to the word segments, the vectors sharing a same vector length; and inputting the vectors into a convolutional neural network to extract the global feature of the text sample, the convolutional neural network including a previous convolutional layer and a current convolutional layer, and a field of view of the previous convolutional layer being used as an input to the current convolutional layer. 7. The method of claim 6 , wherein the global feature of the text sample is further extracted by: removing useless feature information from the text sample via a pooling layer. 8. The method of claim 1 , wherein training of the image-text matching model further includes: respectively mapping the global feature of the image sample and the global feature of the text sample through at least two fully connected layers into a specified semantic space; and respectively mapping the local feature of the image sample and the local feature of the text sample through the at least two fully connected layers into the specified semantic space. 9. The method of claim 1 , wherein the method and training of the image-text matching model are both performed by a same computing device or performed by different computing devices. 10. The method of claim 1 , wherein the global feature of the image sample is extracted using a global image convolutional neural network, wherein the local feature of the image sample is extracted using a local image convolutional neural network, wherein the global feature of the text sample is extracted using a global text encoder, or wherein the local feature of the text sample is extracted using a local text encoder. 11. An apparatus, comprising: a memory storing computer program instructions; and a processor coupled to the memory and configured to execute the computer program instructions and perform a method including: extracting a global feature of an image sample, and extracting a local feature of the image sample; extracting a global feature of a text sample, and extracting a local feature of the text sample; and training an image-text matching model according to the global feature and the local feature of the image sample and according to the global feature and the local feature of the text sample, wherein the global feature of the image sample includes a feature of a fully connected layer of a convolutional neural network of the image sample, and wherein each node of the fully connected layer is connected to all nodes of a previous layer of the convolutional neural network. 12. The apparatus of claim 11 , wherein the processor is configured to execute the computer program instructions and further perform: determining, via the image-text matching model, and according to a global feature and a local feature of an inputted image and a global feature and a local feature of an inputted text, a matching degree between the inputted image and the inputted text. 13. The apparatus of claim 12 , wherein the processor is configured to execute the computer program instructions and further perform: extracting the global feature and the local feature of the inputted image or the global feature and the local feature of the inputted text; and sending the global feature and the local feature of the inputted image or the global feature and the local feature of the inputted text to a server, for the server to determine the matching degree between the inputted image and the inputted text; or sending the inputted image or the inputted text to the server, for the server to extract the global feature and the local feature of the inputted image or the global feature and the local feature of the inputted text and then for the server to determine the matching degree between the inputted image and the inputted text. 14. The apparatus of claim 12 , wherein the processor is configured to execute the computer program instructions and further perform: mapping the global feature of the inputted image and the global feature of the inputted text via the image-text matching model into a specified semantic space, to calculate a similarity between the global feature of the inputted image and the global feature of the inputted text; mapping the local feature of

Assignees

Inventors

Classifications

  • based on eigen-space representations, e.g. from pose or different illumination conditions; Shape manifolds · CPC title

  • Proximity, similarity or dissimilarity measures · CPC title

  • G06F18/214Primary

    Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN] · CPC title

  • G06V30/413Primary

    Classification of content, e.g. text, photographs or tables · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11699298B2 cover?
This application relates to the field of artificial intelligence technologies, and in particular, to a training method of an image-text matching model, a bi-directional search method, and a relevant apparatus. The training method includes extracting a global feature and a local feature of an image sample; extracting a global feature and a local feature of a text sample; training a matching mode…
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F18/214. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 11 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).