Generating labeled training images for use in training a computational neural network for object or action recognition
US-11551079-B2 · Jan 10, 2023 · US
US12131365B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12131365-B2 |
| Application number | US-202016828776-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 24, 2020 |
| Priority date | Mar 25, 2019 |
| Publication date | Oct 29, 2024 |
| Grant date | Oct 29, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A search engine server includes a communication interface through which to receive a multi-modal query from a browser of a client device, the multi-modal query including at least a first image of an item. A processing device, coupled to the communication interface, is to: execute a neural network (NN) regressor model on the first image to identify a plurality of second items that are similar to and compatible with the item depicted in the first image, wherein a set of images correspond to the plurality of second items; generate structured text that explains, within one of a phrase or a sentence, why the set of images are relevant to the item; and return, to the browser of the client device via the communication interface, a set of search results comprising the set of images and the structured text.
Opening claim text (preview).
What is claimed is: 1. A search engine server comprising: a communication interface to receive a multi-modal query from a browser of a client device, the multi-modal query comprising at least a first image of an item; and a processing device coupled to the communication interface, the processing device to: train a neural network (NN) regressor model by iteratively performing operations comprising: computing a visual semantic embedding for a training image categorized similar to the first image, the visual semantic embedding comprising feature vectors mapped within a type-specific feature space; executing, on the visual semantic embedding, one or more sets of fully connected NN layers and rectifier linear unit layers to generate intermediate NN vector outputs; executing a complete vector predictor on the intermediate NN vector outputs to predict values of a complete text vector corresponding to the training image, the complete text vector comprising a subset of characteristic terms of a text lexicon of item characteristics that are predicted as a group with one or more NN layers; executing an individual term predictor on the intermediate NN vector outputs to separately predict individual term values using corresponding individual NN neurons, wherein the individual term values are separately related to respective characteristic terms of the text lexicon; and minimizing a loss function using the predicted values of the complete text vector and the predicted individual term values generated by the individual term predictor to generate a final subset of predicted term values that trains the NN regressor model to generate a string of the characteristic terms most associated with the training image; execute the trained NN regressor model on the first image to identify a plurality of second items that are one of similar to or compatible with the item depicted in the first image; generate, using characteristic terms corresponding to highest values of the final subset of predicted term values over multiple training iterations, structured text that explains, within one of a phrase or a sentence, why the plurality of second items are relevant to the item; and return, to the browser of the client device, a set of search results comprising a set of images, corresponding to the plurality of second items, and the structured text. 2. The search engine of claim 1 , wherein the item and the plurality of second items are fashion items and the NN regressor model is trained to recognize the fashion items via use of a text lexicon of fashion characteristics, the fashion characteristics categorized within a group comprising type, color, material, style, shape, pattern, trim, and brand. 3. The search engine of claim 1 , wherein the multi-modal query further comprises one or more word that describes the first image, and wherein the processing device is further to: execute the trained NN regressor model on the first image and the one or more word to identify the plurality of second items; and generate the structured text that explains why the set of images are relevant to the item, as described by the one or more word. 4. The search engine server of claim 1 , wherein to train the NN regressor model, the processing device is further to calculate a mean square loss between the predicted values within the complete text vector and the predicted individual term values for corresponding individual terms generated using the individual term predictor to determine which of the characteristic terms are most descriptive of the training image. 5. The search engine server of claim 1 , wherein the multi-modal query further comprises one or more word that describes the first image, and wherein, to generate the structured text, the processing device is further to: determine a number of most associated items of the plurality of second items; compute, using a sigmoid activation layer, an average activation score for each characteristic term, in a text lexicon, associated with each respective second item of the number of most associated items; determine a subset of the characteristic terms that have a highest average activation score; and employ the subset of the characteristic terms within the structured text that forms explanations for delivery of the set of images in the set of search results. 6. The search engine server of claim 5 , wherein, to determine the number of most associated items, the processing device is further to: compute a visual semantic embedding of each second item of the plurality of second items, the visual semantic embedding comprising feature vectors mapped within a type-specific feature space; compute association scores for the one or more word of the multi-modal query via comparison of the one or more word to each respective visual semantic embedding; and return a subset of the second items that has a highest association score of the computed association scores as the number of most associated items of the plurality of second items. 7. The search engine of claim 1 , wherein executing the complete vector predictor and executing the individual term predictor are performed in parallel. 8. A method comprising: receiving, via a communication interface, a multi-modal query from a browser of a client device, the multi-modal query comprising at least a first image of an item; training, using a processing device, a neural network (NN) regressor model by iteratively performing operations comprising: computing a visual semantic embedding for a training image categorized similar to the first image, the visual semantic embedding comprising feature vectors mapped within a type-specific feature space; executing, on the visual semantic embedding, one or more sets of fully connected NN layers and rectifier linear unit layers to generate intermediate NN vector outputs; executing a complete vector predictor on the intermediate NN vector outputs to predict values of a complete text vector corresponding to the training image, the complete text vector comprising a subset of characteristic terms of a text lexicon of item characteristics that are predicted as a group with one or more NN layers; executing an individual term predictor on the intermediate NN vector outputs to separately predict individual term values using corresponding individual NN neurons, wherein the individual term values are separately related to respective characteristic terms of the text lexicon; and minimizing a loss function using the predicted values of the complete text vector and the predicted individual term values generated by the individual term predictor to generate a final subset of predicted term values that trains the NN regressor model to generate a string of the characteristic terms most associated with the training image; executing the trained NN regressor model on the first image to identify a plurality of second items that are one of similar to or compatible with the item depicted in the first image; generating, by the processing device and using characteristic terms corresponding to highest values of the final subset of predicted term values over multiple training iterations, structured text that explains, within one of a phrase or a sentence, why the plurality of second items are relevant to the item; and returning, to the browser of the client device, a set of search results comprising a set of images, corresponding to the plurality of second items, and the structured text. 9. The method of claim 8 , wherein the item and the plurality of second items are fashion items and the NN regressor model is trained to recognize the fashion items via use of a text lexicon of fashion characteristics, the fashion characteristics categorized within a group comprising t
Convolutional networks [CNN, ConvNet] · CPC title
Supervised learning · CPC title
Classification techniques · CPC title
Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.