Embedding space for images with multiple text labels
US-10026020-B2 · Jul 17, 2018 · US
US11238362B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11238362-B2 |
| Application number | US-201614996959-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 15, 2016 |
| Priority date | Jan 15, 2016 |
| Publication date | Feb 1, 2022 |
| Grant date | Feb 1, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Modeling semantic concepts in an embedding space as distributions is described. In the embedding space, both images and text labels are represented. The text labels describe semantic concepts that are exhibited in image content. In the embedding space, the semantic concepts described by the text labels are modeled as distributions. By using distributions, each semantic concept is modeled as a continuous cluster which can overlap other clusters that model other semantic concepts. For example, a distribution for the semantic concept “apple” can overlap distributions for the semantic concepts “fruit” and “tree” since can refer to both a fruit and a tree. In contrast to using distributions, conventionally configured visual-semantic embedding spaces represent a semantic concept as a single point. Thus, unlike these conventionally configured embedding spaces, the embedding spaces described herein are generated to model semantic concepts as distributions, such as Gaussian distributions, Gaussian mixtures, and so on.
Opening claim text (preview).
What is claimed is: 1. A method implemented by a computing device to annotate images with determined text labels to describe content of the images, the method comprising: generating an embedding space representing both images and text labels of a text vocabulary, including: computing distributions representing semantic concepts in the embedding space rather than representing the semantic concepts as single points, the semantic concepts for which the distributions are computed being described by respective text labels of the text vocabulary and capable of being depicted in image content; determining semantic relationships between meanings of the text labels of the text vocabulary; positioning the distributions in the embedding space based on the semantic relationships determined for the respective text labels; and mapping representative images to the distributions of the embedding space, wherein the image content depicted by the representative images exemplifies corresponding semantic concepts of the distributions; determining a set of semantically meaningful image regions of a query image, the set of semantically meaningful image regions of the query image being mappable to the text labels in the embedding space; processing the set of semantically meaningful image regions of the query image to discard semantically meaningful image regions of the query image that fail to meet at least one predefined criterion and to obtain a subset of the semantically meaningful image regions of the query image that meet the at least one predefined criterion; determining, using the embedding space, at least one of the text labels of the embedding space describing at least one depicted semantic concept in criteria-meeting image regions of the query image; and annotating the query image by associating the determined text labels with the query image. 2. A method as described in claim 1 , wherein the distributions are computed as Gaussian distributions representing the semantic concepts. 3. A method as described in claim 1 , wherein the distributions are computed as Gaussian mixtures representing the semantic concepts. 4. A method as described in claim 1 , wherein generating the embedding space further includes: processing a plurality of training images, each training image having multiple text labels, said processing including generating sets of image regions that correspond to respective labels of the multiple text labels; and setting the sets of image regions as the representative images for the mapping to the distributions of the embedding space. 5. A method as described in claim 4 , wherein processing the plurality of training images includes, for each training image: determining candidate image regions for a respective set of image regions of the training image; and reducing a number of the determined candidate image regions using at least one post-processing technique. 6. A method as described in claim 5 , wherein the candidate image regions are determined using geodesic object proposal. 7. A method as described in claim 5 , wherein the at least one post-processing technique involves enforcing a size criterion by discarding candidate image regions having less than a threshold size. 8. A method as described in claim 5 , wherein the at least one post-processing technique involves enforcing an aspect ratio criterion by discarding candidate image regions having aspect ratios outside predefined allowable aspect ratios. 9. A method as described in claim 5 , wherein the at least one post-processing technique includes assigning a single candidate image region to each respective label of the multiple text labels of the training image based on a single-label embedding model. 10. A method as described in claim 1 , wherein determining the at least one text label includes computing distances in the embedding space between embeddings of the semantically meaningful image regions of the query image and the distributions. 11. A method as described in claim 10 , wherein the distances are computed using vectors that represent respective semantically meaningful image regions of the query image, the vectors extracted from the semantically meaningful image regions of the query image with a Convolutional Neural Network (CNN). 12. A method as described in claim 10 , further comprising selecting the at least one text label for association with the query image based on the distances. 13. A method as described in claim 1 , further comprising presenting indications of the criteria-meeting image regions of the query image that correspond to the at least one text label. 14. A method as described in claim 1 , wherein the query image is annotated in conjunction with indexing the query image for search. 15. A method as described in claim 1 , further comprising presenting the criteria-meeting image regions of the query image, the presented criteria-meeting image regions of the query image changed visually to appear different from other portions of the query image. 16. A system to annotate images with determined text labels to describe content of the image, the system comprising: one or more processors; and computer-readable storage media having stored thereon instructions that are executable by the one or more processors to perform operations comprising: processing a training image having multiple text labels, said processing including generating a set of image regions that correspond to respective labels of the multiple text labels; embedding the set of image regions within an embedding space representing semantic concepts as distributions rather than representing the semantic concepts as single points, the semantic concepts represented being described by text labels of a text vocabulary and capable of being depicted in image content, the set of image regions embedded with distributions representing the semantic concepts depicted in the image content of the set of image regions, the distributions in the embedding space positioned based on semantic relationships determined for the text labels, and determination of the semantic relationships being based on meanings of the text labels of the text vocabulary; determining a set of semantically meaningful image regions of a query image, the set of semantically meaningful image regions of the query image being mappable to the text labels in the embedding space; processing the set of semantically meaningful image regions of the query image to discard the semantically meaningful image regions of the query image that fail to meet at least one predefined criterion and to obtain a subset of semantically meaningful image regions of the query image that meet the at least one predefined criterion; determining the text labels that describe depicted semantic concepts of the query image by mapping criteria-meeting image regions of the query image to the distributions of the embedding space; and annotating the query image with at least two of the determined text labels. 17. A system as described in claim 16 , further comprising presenting the image regions of the query image that correspond to the at least two determined text labels. 18. A system as described in claim 16 , wherein the query image is annotated in conjunction with indexing the query image for search. 19. One or more computer-readable storage media comprising instructions stored thereon that, responsive to execution by a computing device, perform operations comprising: generating an embedding space representing both images and text labels of a text vocabulary, including
Machine learning · CPC title
Combinations of networks · CPC title
Smoothing the distance, e.g. radial basis function networks [RBFN] · CPC title
using information manually generated, e.g. tags, keywords, comments, manually generated location and time information · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.