Multimodal Image Classifier using Textual and Visual Embeddings
US-2021264203-A1 · Aug 26, 2021 · US
US12062081B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12062081-B2 |
| Application number | US-202318103862-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 31, 2023 |
| Priority date | Jan 31, 2020 |
| Publication date | Aug 13, 2024 |
| Grant date | Aug 13, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system including one or more processors and one or more non-transitory computer-readable media storing computing instructions that, when executed on the one or more processors, cause the one or more processors to perform functions including: receiving a respective item description and at least one respective attribute value for each item of a set of items; generating at least one respective text embedding; generating a graph of the set of items based on at least co-view data to create pairs of items that are co-viewed by joining respective pairs of items; training the text embedding model and a machine learning model using a neural loss function based on the graph; and automatically determining, using the machine learning model, as trained, a label for each item of the set of items. Other embodiments are disclosed.
Opening claim text (preview).
What is claimed: 1. A system comprising: one or more processors; and one or more non-transitory computer-readable media storing computing instructions that, when executed on the one or more processors, cause the one or more processors to perform functions comprising: receiving a respective item description and at least one respective attribute value for each item of a set of items; generating at least one respective text embedding using a text embedding model for each item of the set of items; generating a graph of the set of items based on at least co-view data to create pairs of items that are co-viewed by joining respective pairs of items that are connected by a set of edges, wherein each pair of items joined by a respective edge of the set of edges in the graph has been viewed together in one or more respective sessions, and the respective edge comprises a respective weight comprising a co-view count of a respective pair of items; training the text embedding model and a machine learning model using a neural loss function based on the graph; and automatically determining, using the machine learning model, as trained, a label for each item of the set of items. 2. The system of claim 1 , wherein: the text embedding model is a Bidirectional Encoder Representations from Transformers (“BERT”); and an output from the text embedding model comprises a vector representation. 3. The system of claim 1 , wherein the set of edges comprises (a) one or more unlabeled-unlabeled edges, (b) one or more labeled-unlabeled edges, and (c) one or more labeled-labeled edges. 4. The system of claim 1 , wherein training the text embedding model and the machine learning model using the neural loss function based on the graph further comprises: training the machine learning model with the neural loss function based on first distances between first text embeddings for first pairs of nodes connected by one or more labeled-labeled edges, second distances between second text embeddings for second pairs of nodes connected by one or more labeled-unlabeled edges, third distances between third text embeddings for third pairs of nodes connected by one or more unlabeled-unlabeled edges, and a softmax loss cost function for fourth text embeddings of nodes of the graph that are labeled. 5. The system of claim 1 , wherein the computing instructions, when executed on the one or more processors, further cause the one or more processors to perform a function comprising: determining, based on an image embedding model, as trained, a label for each second item of the set of items that does not meet a predetermined threshold. 6. The system of claim 5 , wherein the predetermined threshold is 5. 7. The system of claim 5 , wherein the computing instructions, when executed on the one or more processors, further cause the one or more processors to perform a function comprising: transforming an image into a vector representing the image using a residual neural network (“ResNet”). 8. The system of claim 1 , wherein the computing instructions when executed on the one or more processors, further cause the one or more processors, to perform a function comprising: training an image embedding model based on images of items from an item catalog database using loss equations to minimize a distance between text representations and image representations for the items. 9. The system of claim 8 , wherein the images of the items from depict items of clothing. 10. The system of claim 1 , wherein the at least one respective attribute value comprises a gender classification. 11. A method being implemented via execution of computing instructions configured to run on one or more processors and stored at one or more non-transitory computer-readable media, the method comprising: receiving a respective item description and at least one respective attribute value for each item of a set of items; generating at least one respective text embedding using a text embedding model for each item of the set of items; generating a graph of the set of items based on at least co-view data to create pairs of items that are co-viewed by joining respective pairs of items that are connected by a set of edges, wherein each pair of items joined by a respective edge of the set of edges in the graph has been viewed together in one or more respective sessions, and the respective edge comprises a respective weight comprising a co-view count of a respective pair of items; training the text embedding model and a machine learning model using a neural loss function based on the graph; and automatically determining, using the machine learning model, as trained, a label for each item of the set of items. 12. The method of claim 11 , wherein: the text embedding model is a Bidirectional Encoder Representations from Transformers (“BERT”); and an output from the text embedding model comprises a vector representation. 13. The method of claim 11 , wherein the set of edges comprises (a) one or more unlabeled-unlabeled edges, (b) one or more labeled-unlabeled edges, and (c) one or more labeled-labeled edges. 14. The method of claim 11 , wherein training the text embedding model and the machine learning model using the neural loss function based on the graph further comprises: training the machine learning model with the neural loss function based on first distances between first text embeddings for first pairs of nodes connected by one or more labeled-labeled edges, second distances between second text embeddings for second pairs of nodes connected by one or more labeled-unlabeled edges, third distances between third text embeddings for third pairs of nodes connected by one or more unlabeled-unlabeled edges, and a softmax loss cost function for fourth text embeddings of nodes of the graph that are labeled. 15. The method of claim 11 further comprising: determining, based on an image embedding model, as trained, a label for each second item of the set of items that does not meet a predetermined threshold. 16. The method of claim 15 , wherein the predetermined threshold is 5. 17. The method of claim 15 further comprising: transforming an image into a vector representing the image using a residual neural network (“ResNet”). 18. The method of claim 11 further comprising: training an image embedding model based on images of items from an item catalog database using loss equations to minimize a distance between text representations and image representations for the items. 19. The method of claim 18 , wherein the images of the items from the item catalog database depict items of clothing. 20. The method of claim 11 , wherein the at least one respective attribute value comprises a gender classification.
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Activation functions · CPC title
Graphs; Linked lists (G06F16/9027 takes precedence) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.