Generating natural language descriptions of images

US10417557B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10417557-B2
Application numberUS-201715856453-A
CountryUS
Kind codeB2
Filing dateDec 28, 2017
Priority dateNov 14, 2014
Publication dateSep 17, 2019
Grant dateSep 17, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating descriptions of input images. One of the methods includes obtaining an input image; processing the input image using a first neural network to generate an alternative representation for the input image; and processing the alternative representation for the input image using a second neural network to generate a sequence of a plurality of words in a target natural language that describes the input image.

First claim

Opening claim text (preview).

What is claimed is: 1. One or more non-transitory computer-storage media having instructions encoded thereon that, when executed by one or more computers, cause the one or more computers to perform operations comprising: obtaining an input image; processing the input image using a first neural network to generate an alternative representation for the input image; and processing, with a long-short term memory (LSTM) neural network, the alternative representation for the input image to generate an output sequence of words in a target natural language that describes the input image, wherein: the words in the output sequence are arranged according to an output order, and processing the alternative representation for the input image comprises, for each position in the output order after an initial position: (i) identifying a word that was selected for the output sequence at a preceding position in the output order that precedes the current position; (ii) processing, with the LSTM neural network, data representing the word that was selected for the output sequence at the preceding position in the output order to generate respective word scores for words in a pre-defined set of possible words, and (iii) selecting, from the pre-defined set of possible words and based on the respective word scores, a particular word for the output sequence for the current position in the output order. 2. The one or more computer-storage media of claim 1 , wherein processing the alternative representation for the input image further comprises, for the initial position in the output order: processing, with the LSTM neural network, a special start word to generate respective word scores for words in the pre-defined set of possible words; and selecting, from the pre-defined set of possible words and based on the respective word scores, a particular word for the initial position in the output order of the output sequence of words. 3. The one or more computer-storage media of claim 1 , wherein processing the alternative representation for the input image further comprises: using a left-to-right beam search decoding to generate a plurality of possible sequences and a respective sequence score for each of the possible sequences; and selecting one or more highest-scoring possible sequences as descriptions of the input image. 4. The one or more computer-storage media of claim 1 , wherein the first neural network and the LSTM neural network are jointly trained. 5. The one or more computer-storage media of claim 1 , wherein the first neural network is a deep convolutional neural network. 6. The one or more computer-storage media of claim 5 , wherein: the deep convolutional neural network comprises a plurality of core neural network layers each having a respective set of parameters; processing the input image using the first neural network comprises processing the input image through each of the core neural network layers of the deep convolutional neural network; and the alternative representation for the input image is the output generated by a last core neural network layer in the plurality of core neural network layers. 7. The one or more computer-storage media of claim 6 , wherein: current values of the respective sets of parameters are determined by training a third neural network on a plurality of training images; and the third neural network includes the plurality of core neural network layers and an output layer configured to, for each training image, receive the output generated by the last core neural network layer for the training image and generate a respective score for each of a plurality of object categories, the respective score for each of the plurality of object categories representing a predicted likelihood that the training image contains an image of an object from the object category. 8. The one or more computer-storage media of claim 1 , wherein the pre-defined set of possible words includes a vocabulary of words in the target natural language and a special stop word. 9. A computer-implemented method, comprising: obtaining an input image; processing the input image using a first neural network to generate an alternative representation for the input image; and processing, with a long-short term memory (LSTM) neural network, the alternative representation for the input image to generate an output sequence of words in a target natural language that describes the input image, wherein: the words in the output sequence are arranged according to an output order, and processing the alternative representation for the input image comprises, for each position in the output order after an initial position: (i) identifying a word that was selected for the output sequence at a preceding position in the output order that precedes the current position; (ii) processing, with the LSTM neural network, data representing the word that was selected for the output sequence at the preceding position in the output order to generate respective word scores for words in a pre-defined set of possible words, and (iii) selecting, from the pre-defined set of possible words and based on the respective word scores, a particular word for the output sequence for the current position in the output order. 10. The computer-implemented method of claim 9 , wherein processing the alternative representation for the input image further comprises, for the initial position in the output order: processing, with the LSTM neural network, a special start word to generate respective word scores for words in the pre-defined set of possible words; and selecting, from the pre-defined set of possible words and based on the respective word scores, a particular word for the initial position in the output order of the output sequence of words. 11. The computer-implemented method of claim 9 , wherein processing the alternative representation for the input image further comprises: using a left-to-right beam search decoding to generate a plurality of possible sequences and a respective sequence score for each of the possible sequences; and selecting one or more highest-scoring possible sequences as descriptions of the input image. 12. The computer-implemented method of claim 9 , wherein the first neural network and the LSTM neural network are jointly trained. 13. The computer-implemented method of claim 9 , wherein the first neural network is a deep convolutional neural network. 14. The computer-implemented method of claim 13 , wherein: the deep convolutional neural network comprises a plurality of core neural network layers each having a respective set of parameters; processing the input image using the first neural network comprises processing the input image through each of the core neural network layers of the deep convolutional neural network; and the alternative representation for the input image is the output generated by a last core neural network layer in the plurality of core neural network layers. 15. The computer-implemented method of claim 14 , wherein: current values of the respective sets of parameters are determined by training a third neural network on a plurality of training images; and the third neural network includes the plurality of core neural network layers and an output layer configured to, for each training image, receive the output generated by the last core neural network layer for the training image and generate a respective score for each of a plurality of object categories, the respective score for each of the plurality of object categories representing a predicted likelihood that the training image contains an image

Assignees

Inventors

Classifications

  • G06N3/045Primary

    Combinations of networks · CPC title

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

  • G06N3/0472Primary

    Physics · mapped topic

  • Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10417557B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating descriptions of input images. One of the methods includes obtaining an input image; processing the input image using a first neural network to generate an alternative representation for the input image; and processing the alternative representation for the input image using a second ne…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/045. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 17 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).