Search engine use of neural network regressor for multi-modal item recommendations based on visual semantic embeddings

US12131365B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12131365-B2
Application numberUS-202016828776-A
CountryUS
Kind codeB2
Filing dateMar 24, 2020
Priority dateMar 25, 2019
Publication dateOct 29, 2024
Grant dateOct 29, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A search engine server includes a communication interface through which to receive a multi-modal query from a browser of a client device, the multi-modal query including at least a first image of an item. A processing device, coupled to the communication interface, is to: execute a neural network (NN) regressor model on the first image to identify a plurality of second items that are similar to and compatible with the item depicted in the first image, wherein a set of images correspond to the plurality of second items; generate structured text that explains, within one of a phrase or a sentence, why the set of images are relevant to the item; and return, to the browser of the client device via the communication interface, a set of search results comprising the set of images and the structured text.

First claim

Opening claim text (preview).

What is claimed is: 1. A search engine server comprising: a communication interface to receive a multi-modal query from a browser of a client device, the multi-modal query comprising at least a first image of an item; and a processing device coupled to the communication interface, the processing device to: train a neural network (NN) regressor model by iteratively performing operations comprising: computing a visual semantic embedding for a training image categorized similar to the first image, the visual semantic embedding comprising feature vectors mapped within a type-specific feature space; executing, on the visual semantic embedding, one or more sets of fully connected NN layers and rectifier linear unit layers to generate intermediate NN vector outputs; executing a complete vector predictor on the intermediate NN vector outputs to predict values of a complete text vector corresponding to the training image, the complete text vector comprising a subset of characteristic terms of a text lexicon of item characteristics that are predicted as a group with one or more NN layers; executing an individual term predictor on the intermediate NN vector outputs to separately predict individual term values using corresponding individual NN neurons, wherein the individual term values are separately related to respective characteristic terms of the text lexicon; and minimizing a loss function using the predicted values of the complete text vector and the predicted individual term values generated by the individual term predictor to generate a final subset of predicted term values that trains the NN regressor model to generate a string of the characteristic terms most associated with the training image; execute the trained NN regressor model on the first image to identify a plurality of second items that are one of similar to or compatible with the item depicted in the first image; generate, using characteristic terms corresponding to highest values of the final subset of predicted term values over multiple training iterations, structured text that explains, within one of a phrase or a sentence, why the plurality of second items are relevant to the item; and return, to the browser of the client device, a set of search results comprising a set of images, corresponding to the plurality of second items, and the structured text. 2. The search engine of claim 1 , wherein the item and the plurality of second items are fashion items and the NN regressor model is trained to recognize the fashion items via use of a text lexicon of fashion characteristics, the fashion characteristics categorized within a group comprising type, color, material, style, shape, pattern, trim, and brand. 3. The search engine of claim 1 , wherein the multi-modal query further comprises one or more word that describes the first image, and wherein the processing device is further to: execute the trained NN regressor model on the first image and the one or more word to identify the plurality of second items; and generate the structured text that explains why the set of images are relevant to the item, as described by the one or more word. 4. The search engine server of claim 1 , wherein to train the NN regressor model, the processing device is further to calculate a mean square loss between the predicted values within the complete text vector and the predicted individual term values for corresponding individual terms generated using the individual term predictor to determine which of the characteristic terms are most descriptive of the training image. 5. The search engine server of claim 1 , wherein the multi-modal query further comprises one or more word that describes the first image, and wherein, to generate the structured text, the processing device is further to: determine a number of most associated items of the plurality of second items; compute, using a sigmoid activation layer, an average activation score for each characteristic term, in a text lexicon, associated with each respective second item of the number of most associated items; determine a subset of the characteristic terms that have a highest average activation score; and employ the subset of the characteristic terms within the structured text that forms explanations for delivery of the set of images in the set of search results. 6. The search engine server of claim 5 , wherein, to determine the number of most associated items, the processing device is further to: compute a visual semantic embedding of each second item of the plurality of second items, the visual semantic embedding comprising feature vectors mapped within a type-specific feature space; compute association scores for the one or more word of the multi-modal query via comparison of the one or more word to each respective visual semantic embedding; and return a subset of the second items that has a highest association score of the computed association scores as the number of most associated items of the plurality of second items. 7. The search engine of claim 1 , wherein executing the complete vector predictor and executing the individual term predictor are performed in parallel. 8. A method comprising: receiving, via a communication interface, a multi-modal query from a browser of a client device, the multi-modal query comprising at least a first image of an item; training, using a processing device, a neural network (NN) regressor model by iteratively performing operations comprising: computing a visual semantic embedding for a training image categorized similar to the first image, the visual semantic embedding comprising feature vectors mapped within a type-specific feature space; executing, on the visual semantic embedding, one or more sets of fully connected NN layers and rectifier linear unit layers to generate intermediate NN vector outputs; executing a complete vector predictor on the intermediate NN vector outputs to predict values of a complete text vector corresponding to the training image, the complete text vector comprising a subset of characteristic terms of a text lexicon of item characteristics that are predicted as a group with one or more NN layers; executing an individual term predictor on the intermediate NN vector outputs to separately predict individual term values using corresponding individual NN neurons, wherein the individual term values are separately related to respective characteristic terms of the text lexicon; and minimizing a loss function using the predicted values of the complete text vector and the predicted individual term values generated by the individual term predictor to generate a final subset of predicted term values that trains the NN regressor model to generate a string of the characteristic terms most associated with the training image; executing the trained NN regressor model on the first image to identify a plurality of second items that are one of similar to or compatible with the item depicted in the first image; generating, by the processing device and using characteristic terms corresponding to highest values of the final subset of predicted term values over multiple training iterations, structured text that explains, within one of a phrase or a sentence, why the plurality of second items are relevant to the item; and returning, to the browser of the client device, a set of search results comprising a set of images, corresponding to the plurality of second items, and the structured text. 9. The method of claim 8 , wherein the item and the plurality of second items are fashion items and the NN regressor model is trained to recognize the fashion items via use of a text lexicon of fashion characteristics, the fashion characteristics categorized within a group comprising t

Assignees

Inventors

Classifications

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Supervised learning · CPC title

  • Classification techniques · CPC title

  • Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12131365B2 cover?
A search engine server includes a communication interface through which to receive a multi-modal query from a browser of a client device, the multi-modal query including at least a first image of an item. A processing device, coupled to the communication interface, is to: execute a neural network (NN) regressor model on the first image to identify a plurality of second items that are similar to…
Who is the assignee on this patent?
Board Of Trustees Of The Univ Of Illinois, Univ Illinois
What technology area does this patent fall under?
Primary CPC classification G06Q30/0627. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 29 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).