System and a method for semantic level image retrieval
US-2020334486-A1 · Oct 22, 2020 · US
US11699298B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11699298-B2 |
| Application number | US-202117349904-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 16, 2021 |
| Priority date | Sep 12, 2017 |
| Publication date | Jul 11, 2023 |
| Grant date | Jul 11, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
This application relates to the field of artificial intelligence technologies, and in particular, to a training method of an image-text matching model, a bi-directional search method, and a relevant apparatus. The training method includes extracting a global feature and a local feature of an image sample; extracting a global feature and a local feature of a text sample; training a matching model according to the extracted global feature and local feature of the image sample and the extracted global feature and local feature of the text sample, to determine model parameters of the matching model; and determining, by the matching model, according to a global feature and a local feature of an inputted image and a global feature and a local feature of an inputted text, a matching degree between the image and the text.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: extracting a global feature of an image sample, and extracting a local feature of the image sample; extracting a global feature of a text sample, and extracting a local feature of the text sample; and training an image-text matching model according to the global feature and the local feature of the image sample and according to the global feature and the local feature of the text sample, wherein the global feature of the image sample includes a feature of a fully connected layer of a convolutional neural network of the image sample, and wherein each node of the fully connected layer is connected to all nodes of a previous layer of the convolutional neural network. 2. The method of claim 1 , further comprising: determining, via the image-text matching model, and according to a global feature and a local feature of an inputted image and a global feature and a local feature of an inputted text, a matching degree between the inputted image and the inputted text. 3. The method of claim 2 , further comprising: extracting the global feature and the local feature of the inputted image or the global feature and the local feature of the inputted text; and sending the global feature and the local feature of the inputted image or the global feature and the local feature of the inputted text to a server, for the server to determine the matching degree between the inputted image and the inputted text; or sending the inputted image or the inputted text to the server, for the server to extract the global feature and the local feature of the inputted image or the global feature and the local feature of the inputted text and then for the server to determine the matching degree between the inputted image and the inputted text. 4. The method of claim 2 , further comprising: mapping the global feature of the inputted image and the global feature of the inputted text via the image-text matching model into a specified semantic space, to calculate a similarity between the global feature of the inputted image and the global feature of the inputted text; mapping the local feature of the inputted image and the local feature of the inputted text into the specified semantic space, to calculate a similarity between the local feature of the inputted image and the local feature of the inputted text; and determining the matching degree between the inputted image and the inputted text according to the similarity between the global feature of the inputted image and the global feature of the inputted text and according to the similarity between the local feature of the inputted image and the local feature of the inputted text. 5. The method of claim 1 , wherein the local feature of the image sample is extracted by: dividing the image sample into a quantity of image blocks; calculating probabilities, respectively corresponding to the quantity of image blocks, that each of the quantity of image blocks includes a specified category of image information; and selecting a maximum probability from the probabilities, features of an image block of the quantity of image blocks corresponding to the maximum probability are regarded as the local feature of the image sample. 6. The method of claim 1 , wherein the global feature of the text sample is extracted by: performing word segmentation on the text sample to obtain word segments; determining vectors respectively corresponding to the word segments, the vectors sharing a same vector length; and inputting the vectors into a convolutional neural network to extract the global feature of the text sample, the convolutional neural network including a previous convolutional layer and a current convolutional layer, and a field of view of the previous convolutional layer being used as an input to the current convolutional layer. 7. The method of claim 6 , wherein the global feature of the text sample is further extracted by: removing useless feature information from the text sample via a pooling layer. 8. The method of claim 1 , wherein training of the image-text matching model further includes: respectively mapping the global feature of the image sample and the global feature of the text sample through at least two fully connected layers into a specified semantic space; and respectively mapping the local feature of the image sample and the local feature of the text sample through the at least two fully connected layers into the specified semantic space. 9. The method of claim 1 , wherein the method and training of the image-text matching model are both performed by a same computing device or performed by different computing devices. 10. The method of claim 1 , wherein the global feature of the image sample is extracted using a global image convolutional neural network, wherein the local feature of the image sample is extracted using a local image convolutional neural network, wherein the global feature of the text sample is extracted using a global text encoder, or wherein the local feature of the text sample is extracted using a local text encoder. 11. An apparatus, comprising: a memory storing computer program instructions; and a processor coupled to the memory and configured to execute the computer program instructions and perform a method including: extracting a global feature of an image sample, and extracting a local feature of the image sample; extracting a global feature of a text sample, and extracting a local feature of the text sample; and training an image-text matching model according to the global feature and the local feature of the image sample and according to the global feature and the local feature of the text sample, wherein the global feature of the image sample includes a feature of a fully connected layer of a convolutional neural network of the image sample, and wherein each node of the fully connected layer is connected to all nodes of a previous layer of the convolutional neural network. 12. The apparatus of claim 11 , wherein the processor is configured to execute the computer program instructions and further perform: determining, via the image-text matching model, and according to a global feature and a local feature of an inputted image and a global feature and a local feature of an inputted text, a matching degree between the inputted image and the inputted text. 13. The apparatus of claim 12 , wherein the processor is configured to execute the computer program instructions and further perform: extracting the global feature and the local feature of the inputted image or the global feature and the local feature of the inputted text; and sending the global feature and the local feature of the inputted image or the global feature and the local feature of the inputted text to a server, for the server to determine the matching degree between the inputted image and the inputted text; or sending the inputted image or the inputted text to the server, for the server to extract the global feature and the local feature of the inputted image or the global feature and the local feature of the inputted text and then for the server to determine the matching degree between the inputted image and the inputted text. 14. The apparatus of claim 12 , wherein the processor is configured to execute the computer program instructions and further perform: mapping the global feature of the inputted image and the global feature of the inputted text via the image-text matching model into a specified semantic space, to calculate a similarity between the global feature of the inputted image and the global feature of the inputted text; mapping the local feature of
based on eigen-space representations, e.g. from pose or different illumination conditions; Shape manifolds · CPC title
Proximity, similarity or dissimilarity measures · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN] · CPC title
Classification of content, e.g. text, photographs or tables · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.