Method of bidirectional image-text retrieval based on multi-view joint embedding space

US11106951B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11106951-B2
Application numberUS-201816622570-A
CountryUS
Kind codeB2
Filing dateJan 29, 2018
Priority dateJul 6, 2017
Publication dateAug 31, 2021
Grant dateAug 31, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A bidirectional image-text retrieval method based on a multi-view joint embedding space includes: performing retrieval with reference to a semantic association relationship at a global level and a local level, obtaining the semantic association relationship at the global level and the local level in a frame-sentence view and a region-phrase view, and obtaining semantic association information in a global level subspace of frame and sentence in the frame-sentence view, obtaining semantic association information in a local level subspace of region and phrase in the region-phrase view, processing data by a dual-branch neural network in the two views to obtain an isomorphic feature and embedding the same in a common space, and using a constraint condition to reserve an original semantic relationship of the data during training, and merging the two semantic association relationships using multi-view merging and sorting to obtain a more accurate semantic similarity between data.

First claim

Opening claim text (preview).

What is claimed is: 1. A bidirectional image-text retrieval method based on a multi-view joint embedding space, comprising: performing bidirectional image-text retrieval with reference to semantic association relationship at a global level and a local level; for a data set D={D 1 , D 2 , . . . , D |D| }, wherein each document D t in the data set includes an image I i and a related piece of text T t , expressed as D t =(I t , T t ), each piece of text including multiple sentences, wherein each sentence independently describing the matching image; in a frame-sentence view based on joint embedding of features extracted from a frame and one or more sentences, setting f i to represent an image of the training image I i , wherein {s i1 , s i2 , . . . , s ik } represents the sentence set in T i , k is a number of sentences in text T i ; in a region-phrase view based on joint embedding of features extracted from a region and one or more phrase, setting r im is set to represent the m th region extracted from frame f i , p in to represent the n th phrase extracted from the sentence in text T i ; and in the said bidirectional retrieval method, obtaining firstly the semantic association relationship at the global level and the local level in the frame-sentence view and the region-phrase view, respectively, and then obtaining semantic understanding by merging the semantic association relationships; the method further comprising: 1) extracting frames of images and sentences in texts separately, sending the images and the sentences into a model to extract the features of the data, and extracting CNN features of the frame and FV features of the sentence; 2) sending the CNN features of the frame and the FV features of the sentence obtained in Step 1) respectively into two branches of a dual-branch neural network, and obtaining isomorphic features of the frame and sentence data through training, mapping the frames and the sentences to the global level subspace, and obtaining semantic association information of image and text data in the frame-sentence view; 3) extracting region RCNN features of all frames by using a RCNN model, and extracting a dependency triplet of phrases of all sentences by using a parser, while retaining the region and phrase features with key information; 4) sending the features of the region and the phrase obtained in Step 3) respectively into two branches of another dual-branch neural network, and obtaining isomorphic features of the region and phrase data through training, mapping regions and phrases to the local level subspace, and obtaining semantic association information of image and text data in the region-phrase view; 5) merging semantic association information of the image and the text data in different views obtained in Step 2) and Step 4) by means of merging and sorting method, to calculate a multi-view distance between the image and the text data in a multi-view joint space, which is used to measure semantic similarity as a sorting criterion in the retrieval process; and 6) calculating distances between retrieval request data and modal data in the data set in the multi-view joint space for a retrieval request, and sorting retrieval results according to the distances. 2. A bidirectional image-text retrieval method according to claim 1 , wherein in Step 1), for the image, a 4,096-dimensional CNN feature vector extracted by a 19-layer VGG model is used as a original feature of the frame; and for the text, a Hybrid Gaussian-Laplacian Mixture Model (HGLMM) is used to extract a FV feature vector as an original feature of the sentence, wherein Principal Components Analysis is used to reduce the 18,000-dimensional feature vector to 4,999-dimension. 3. A bidirectional image-text retrieval method according to claim 1 , wherein in Step 2), the features are sent into two brands of the dual-branch neural network separately for training, and the isomorphic features of the image and the sentence data are obtained, wherein constraints are set in the process of training to reserve the inter-modal consistency and the intra-modal consistency, and an interval-based random loss function is adopted, the method further comprising: A. training frame f i : dividing all sentences into the matching set and the non-matching set, wherein matching set contains all sentences matching the training frames, wherein the non-matching set contains all the sentences that do not match the training frames, wherein the consistency constraint requirements include: in the frame-sentence view, the distance between the frame f i and the sentence in the matching set must be smaller than the distance between the frame f i and the sentence in the non-matching set, and the distance difference shall be larger than an interval m, wherein a mathematical representation is as shown in Formula (1): d ( f t ,s ix )+ m<d ( f i ,s jy ) if i≠j   (1) where d(f i , s ix ) represents the distance between the frame f i and the sentence s ix in the matching set; and d(f i , s jy ) represents the distance between the frame f i and the sentence s jy in the non-matching set; B. applying the constraints in Formula (2) to the training sentence s ix : d ( f i ,s ix )+ m<d ( f j ,s ix ) if i≠j   (2) where d(f i , s ix ) represents the distance between the sentence s ix and the frame f i in the matching set; and d(f i , s ix ) represents the distance between sentence s ix and the frame f j in the non-matching set; C. setting constraints for multiple sentences for the same frame in the data set, expressed as Formula (3): d ( s ix ,s iy )+ m<d ( s ix ,s jz ) if i≠j   (3) where d(s ix , s iy ) represents the distance between the sentence s ix and s iy in the same frame f i ; and d(s ix , s jz ) represents the distance between the sentence s ix of frame f i and the sentence s jz of frame f j ; D. defining the loss function established in the frame-sentence view by Formula (4): ψ frame - sentence = ∑ i , j , x , y ⁢ max ⁡ [ 0 , m + d ⁡ ( f i , s i ⁢ x

Assignees

Inventors

Classifications

  • G06F40/205Primary

    Parsing · CPC title

  • in augmented reality scenes · CPC title

  • Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level (multimodal speaker identification or verification G10L17/10) · CPC title

  • using neural networks · CPC title

  • of extracted features · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11106951B2 cover?
A bidirectional image-text retrieval method based on a multi-view joint embedding space includes: performing retrieval with reference to a semantic association relationship at a global level and a local level, obtaining the semantic association relationship at the global level and the local level in a frame-sentence view and a region-phrase view, and obtaining semantic association information i…
Who is the assignee on this patent?
Univ Peking Shenzhen Graduate School, Peking Univ Shenzhen Graduate Sohool
What technology area does this patent fall under?
Primary CPC classification G06F40/205. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 31 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).