Pretraining framework for neural networks
US-2023019211-A1 · Jan 19, 2023 · US
US11928854B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11928854-B2 |
| Application number | US-202318144045-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 5, 2023 |
| Priority date | May 6, 2022 |
| Publication date | Mar 12, 2024 |
| Grant date | Mar 12, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for object detection. In one aspect, a method comprises: obtaining: (i) an image, and (ii) a set of one or more query embeddings, wherein each query embedding represents a respective category of object; processing the image and the set of query embeddings using an object detection neural network to generate object detection data for the image, comprising: processing the image using an image encoding subnetwork of the object detection neural network to generate a set of object embeddings; processing each object embedding using a localization subnetwork to generate localization data defining a corresponding region of the image; and processing: (i) the set of object embeddings, and (ii) the set of query embeddings, using a classification subnetwork to generate, for each object embedding, a respective classification score distribution over the set of query embeddings.
Opening claim text (preview).
What is claimed is: 1. A method performed by one or more computers, the method comprising: obtaining: (i) an image, and (ii) a set of one or more query embeddings, wherein each query embedding represents a respective category of object; processing the image and the set of query embeddings using an object detection neural network to generate object detection data for the image, comprising: processing the image using an image encoding subnetwork of the object detection neural network to generate a set of object embeddings, wherein the image encoding subnetwork comprises one or more self-attention neural network layers; processing each object embedding using a localization subnetwork of the object detection neural network to generate localization data defining a corresponding region of the image; and processing: (i) the set of object embeddings, and (ii) the set of query embeddings, using a classification subnetwork of the object detection neural network to generate, for each object embedding, a respective classification score distribution over the set of query embeddings, wherein the respective classification score distribution for each of the object embeddings defines, for each query embedding, a likelihood that the region of the image corresponding to the object embedding depicts an object that is included in the category represented by the query embedding. 2. The method of claim 1 , wherein for one or more of the query embeddings, obtaining the query embedding comprises: obtaining a text sequence that describes a category of object; and processing the text sequence using a text encoding subnetwork of the object detection neural network to generate the query embedding; wherein the image encoding subnetwork and the text encoding subnetwork are pre-trained, wherein the pre-training includes repeatedly performing operations comprising: obtaining: (i) a training image, (ii) a positive text sequence, wherein the positive text sequence characterizes the training image, and (iii) one or more negative text sequences, wherein the negative text sequences do not characterize the training image; generating an embedding of the training image using the image encodin subnetwork, comprising: processing the training image using the image encoding subnetwork to generate a set of object embeddings for the training image; and processing the object embeddings using an embedding neural network to generate the embedding of the training image; generating respective embeddings of the positive text sequence and each of the negative text sequences using the text encoding subnetwork; and jointly training the image encoding subnetwork and the text encoding subnetwork to encourage: (i) greater similarity between the embedding of the training image and the embedding of the positive text sequence, (ii) lesser similarity between the embedding of the training image and the embeddings of the negative text sequences, comprising: jointly training the image encoding subnetwork and the text encoding subnetwork to optimize an objective function that includes a contrastive loss term. 3. The method of claim 2 , wherein the embedding neural network is jointly trained along with the image encoding subnetwork and the text encoding subnetwork. 4. The method of claim 2 , wherein after the pre-training of the image encoding subnetwork and the text encoding subnetwork, the object detection neural network is trained to optimize an objective function that measures performance of the object detection neural network on a task of object detection in images. 5. The method of claim 4 , wherein the objective function that measures performance of the object detection neural network on the task of object detection in images comprises a bipartite matching loss term. 6. The method of claim 2 , wherein processing: (i) the set of object embeddings, and (ii) the set of query embeddings, using a classification subnetwork of the object detection neural network to generate, for each object embedding, a respective classification score distribution over the set of query embeddings, comprises: processing each object embedding using one or more neural network layers of the classification neural network to generate a corresponding classification embedding; and generating, for each object embedding, the classification score distribution over the set of query embeddings using: (i) the classification embedding corresponding to the object embedding, and (ii) the query embeddings, comprising: generating a respective measure of similarity between the classification embedding and each query embedding, wherein the measure of similarity between the classification embedding and a query embedding defines a likelihood that the region of the image corresponding to the object embedding depicts an object that is included in the category represented by the query embedding. 7. The method of claim 6 , wherein processing each object embedding using one or more neural network layers of the classification neural network to generate a corresponding classification embedding comprises: generating each classification embedding by projecting the corresponding object embedding into a latent space that includes the query embeddings. 8. The method of claim 6 , wherein generating the respective measure of similarity between the classification embedding and each query embedding comprises, for each query embedding: computing an inner product between the classification embedding and the query embedding. 9. The method of claim 2 , wherein for each object embedding, processing the object embedding using the localization subnetwork to generate localization data defining the corresponding region of the image comprises: processing the object embedding using the localization subnetwork to generate localization data defining a bounding box in the image. 10. The method of claim 1 , wherein processing the image using the image encoding subnetwork to generate the set of object embeddings comprises: generating a set of initial object embeddings by an embedding layer of the image encoding subnetwork, wherein each initial object embedding is derived at least in part from a corresponding patch in the image; and processing the set of initial object embeddings by a plurality of neural network layers, including the one or more self-attention neural network layers of the image encoding subnetwork, to generate a set of final object embeddings. 11. The method of claim 10 , wherein processing an object embedding using the localization subnetwork to generate localization data defining the corresponding region of the image comprises: generating a set of offset coordinates, wherein the offset coordinates define an offset of the corresponding region of the image from a location of the image patch corresponding to the object embedding. 12. The method of claim 2 , wherein the text encoding subnetwork comprises one or more self-attention neural network layers. 13. The method of claim 2 , further comprising, for one or more of the object embeddings: determining that the region of the image corresponding to the object embedding depicts an object that is included in the category represented by a query embedding based on the classification score distribution for the object embedding. 14. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: obtaining: (i) an image, and (ii) a set of one
Artificial neural networks [ANN] · CPC title
Architecture, e.g. interconnection topology · CPC title
using syntactic or structural representations of the image or video pattern, e.g. symbolic string recognition; using graph matching · CPC title
Validation; Performance evaluation · CPC title
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.