Method, apparatus, device and medium for generating captioning information of multimedia data
US-2022014807-A1 · Jan 13, 2022 · US
US12387388B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12387388-B2 |
| Application number | US-202318149542-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 3, 2023 |
| Priority date | Jan 3, 2023 |
| Publication date | Aug 12, 2025 |
| Grant date | Aug 12, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In one embodiment, a method includes accessing a text input and a scene input corresponding to the text input, wherein the scene input comprises semantic segmentations, generating text tokens for the text input and scene tokens for the scene input by machine-learning models, generating predicted image tokens based on the text tokens and the scene tokens by the machine-learning models, and generating an image corresponding to the text input and the scene input based on the predicted image tokens by the machine-learning models.
Opening claim text (preview).
What is claimed is: 1. A method comprising, by one or more computing systems: accessing a text input and a scene input corresponding to the text input, wherein the scene input comprises one or more semantic segmentations; generating, by one or more machine-learning models, one or more text tokens for the text input and one or more scene tokens for the scene input; generating, based on the one or more text tokens and the one or more scene tokens by the one or more machine-learning models, one or more predicted image tokens; and generating, based on the one or more predicted image tokens by the one or more machine-learning models, an image corresponding to the text input and the scene input. 2. The method of claim 1 , wherein the one or more machine-learning models comprise one or more of a text encoder, a scene encoder, an image encoder, a transformer neural network model, or an image decoder. 3. The method of claim 1 , wherein the one or more semantic segmentations are associated with one or more categories based on one or more of panoptic, human, or face. 4. The method of claim 3 , wherein the one or more scene tokens are based on a plurality of channels, and wherein a number of the plurality of channels is based on a number of categories based on panoptic, a number of categories based on human, a number of categories based on face, and an edge channel corresponding to a map of edges separating the one or more semantic segmentations. 5. The method of claim 1 , wherein the one or more text tokens are associated with a conditional token stream conditioned on the text input, wherein the method further comprises: generating an unconditional token stream conditioned on an empty text stream initialized with padding tokens, and wherein the generating the one or more predicted image tokens is based on the conditional token stream and the unconditional token stream. 6. The method of claim 5 , wherein the one or more machine-learning models comprise a transformer neural network model, wherein the method further comprises: determining, by the transformer neural network model, a plurality of first probabilities associated with a plurality of predicted image tokens based on the conditional token stream and the one or more scene tokens; determining, by the transformer neural network model, a plurality of second probabilities associated with the plurality of predicted image tokens based on the unconditional token stream and the one or more scene tokens; and determining the one or more predicted image tokens from the plurality of predicted image tokens based on the plurality of first probabilities and the plurality of second probabilities. 7. The method of claim 5 , further comprising: calculating one or more conditional logit scores based on the conditional token stream; calculating one or more unconditional logit scores based on the unconditional token stream; and calculating one or more guided logit scores based on the one or more conditional logit scores and the one or more unconditional logit scores, and wherein the generating the one or more predicted image tokens is based on the one or more guided logic scores. 8. The method of claim 1 , further comprising: generating the one or more semantic segmentations from an existing image. 9. The method of claim 8 , wherein the generating the one or more semantic segmentations is based on a segmentation model, and wherein the segmentation model is trained based on one or more of a weighted binary cross-entropy face loss applied over segmentation face parts categories or a semantic segmentation edge map comprising face parts edges. 10. The method of claim 8 , wherein the one or more semantic segmentations are associated with one or more labeled categories, and wherein the method further comprises: receiving one or more edits of one or more of the labeled categories; and updating the image based on the one or more edits, wherein the updated image depicts a scene based on the edits of the one or more of the labeled categories. 11. The method of claim 8 , further comprising: receiving one or more edits of the text input; and generating, based on the one or more edits of the text input and the scene input, one or more newly interpreted images for the existing image. 12. The method of claim 1 , wherein the one or more semantic segmentations are created by a user. 13. The method of claim 1 , wherein the generating the one or more predicted image tokens is based on an image encoder, wherein the generating the image corresponding to the text input and the scene input is based on an image decoder, and wherein the image encoder or the image decoder is trained based on a feature-matching loss over activations of a pre-trained face-embedding network comparing between reconstructed face crops and ground-truth face crops. 14. The method of claim 1 , wherein the generating the one or more predicted image tokens is based on an image encoder, wherein the generating the image corresponding to the text input and the scene input is based on an image decoder, and wherein the image encoder or the image decoder is trained based on a feature-matching loss over activations of a pre-trained object-recognition network comparing between reconstructed object crops and ground-truth object crops. 15. The method of claim 1 , wherein the text input comprises a description of an unusual scene not existing in reality, and wherein the generated image depicts the unusual scene not existing in reality. 16. One or more computer-readable non-transitory storage media embodying software that is operable when executed to: access a text input and a scene input corresponding to the text input, wherein the scene input comprises one or more semantic segmentations; generate, by one or more machine-learning models, one or more text tokens for the text input and one or more scene tokens for the scene input; generate, based on the one or more text tokens and the one or more scene tokens by the one or more machine-learning models, one or more predicted image tokens; and generate, based on the one or more predicted image tokens by the one or more machine-learning models, an image corresponding to the text input and the scene input. 17. A system comprising: one or more processors; and a non-transitory memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to: access a text input and a scene input corresponding to the text input, wherein the scene input comprises one or more semantic segmentations; generate, by one or more machine-learning models, one or more text tokens for the text input and one or more scene tokens for the scene input; generate, based on the one or more text tokens and the one or more scene tokens by the one or more machine-learning models, one or more predicted image tokens; and generate, based on the one or more predicted image tokens by the one or more machine-learning models, an image corresponding to the text input and the scene input.
using tickets or tokens, e.g. Kerberos (network architectures or network communication protocols for entities authentication using tickets in a packet data network H04L63/0807) · CPC title
Face · CPC title
Image coding (bandwidth or redundancy reduction for static pictures H04N1/41; coding or decoding of static colour picture signals H04N1/64; methods or arrangements for coding, decoding, compressing or decompressing digital video signals H04N19/00) · CPC title
using neural networks · CPC title
Edge detection · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.