Natural language image editing annotation framework
US-2019278844-A1 · Sep 12, 2019 · US
US11195048B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11195048-B2 |
| Application number | US-202016750478-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 23, 2020 |
| Priority date | Jan 23, 2020 |
| Publication date | Dec 7, 2021 |
| Grant date | Dec 7, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In implementations of generating descriptions of image relationships, a computing device implements a description system which receives a source digital image and a target digital image. The description system generates a source feature sequence from the source digital image and a target feature sequence from the target digital image. A visual relationship between the source digital image and the target digital image is determined by using cross-attention between the source feature sequence and the target feature sequence. The system generates a description of a visual transformation between the source digital image and the target digital image based on the visual relationship.
Opening claim text (preview).
What is claimed is: 1. In a digital medium environment to generate a description of a visual transformation between a source digital image and a target digital image, a method implemented by a computing device, the method comprising: receiving, by the computing device, the source digital image and the target digital image; generating, by the computing device, a source feature sequence from the source digital image and a target feature sequence from the target digital image, features of the source feature sequence each represent a portion of the source digital image and features of the target feature sequence each represent a portion of the target digital image; determining, by the computing device, a visual relationship between the source digital image and the target digital image using cross-attention between the features of the source feature sequence and the features of the target feature sequence; and generating, by the computing device for display in a user interface, the description of the visual transformation based on the visual relationship as including a difference between an environment scene at a first point in time and the environment scene at a second point in time. 2. The method as described in claim 1 , wherein the description of the visual transformation is generated as text. 3. The method as described in claim 1 , wherein the visual transformation includes an image editing operation. 4. The method as described in claim 1 , further comprising captioning the source digital image and the target digital image with the description of the visual transformation. 5. The method as described in claim 1 , wherein the description of the visual transformation includes a natural language image editing instruction. 6. The method as described in claim 1 , wherein the description of the visual transformation includes a description of objects depicted in the source digital image or the target digital image. 7. The method as described in claim 1 , wherein the target digital image includes an object that is excluded from the source digital image. 8. The method as described in claim 1 , wherein the target digital image excludes an object that is included in the source digital image. 9. The method as described in claim 1 , wherein determining the visual relationship between the source digital image and the target digital image includes concatenating the source feature sequence and the target feature sequence into a single feature sequence. 10. In a digital medium environment to generate a description of a visual transformation between a source digital image and a target digital image, a system comprising: a feature module implemented at least partially in hardware of a computing device to: receive the source digital image and the target digital image; and generate a source feature sequence from the source digital image and a target feature sequence from the target digital image, features of the source feature sequence each represent a portion of the source digital image and features of the target feature sequence each represent a portion of the target digital image; a relationship module implemented at least partially in the hardware of the computing device to determine a visual relationship between the source digital image and the target digital image using cross-attention between the features of the source feature sequence and the features of the target feature sequence; and a rendering module implemented at least partially in the hardware of the computing device to generate, for display in a user interface of a display device, the description of the visual transformation based on the visual relationship as including a difference between an environment scene at a first point in time and the environment scene at a second point in time. 11. The system as described in claim 10 , wherein the visual transformation includes an image editing operation. 12. The system as described in claim 10 , wherein the description of the visual transformation includes a natural language image editing instruction. 13. The system as described in claim 10 , wherein the description of the visual transformation includes a description of an object depicted in the source digital image or the target digital image. 14. The system as described in claim 10 , wherein the relationship module includes a Long Short-Term Memory (LSTM) decoder. 15. One or more non-transitory computer-readable storage media comprising instructions stored thereon that, responsive to execution by a computing device in a digital medium environment to generate a description of a visual transformation between a source digital image and a target digital image, cause operations of the computing device including: generating a source feature sequence from a source feature map extracted from the source digital image and generating a target feature sequence from a target feature map extracted from the target digital image, features of the source feature sequence each represent a portion of the source digital image and features of the target feature sequence each represent a portion of the target digital image; determining a visual relationship between the source digital image and the target digital image using cross-attention between the features of the source feature sequence and the features of the target feature sequence and cross-attention between the features of the target feature sequence and the features of the source feature sequence; and generating, for display in a user interface of a display device, the description of the visual transformation based on the visual relationship as including a difference between an environment scene at a first point in time and the environment scene at a second point in time. 16. The one or more non-transitory computer-readable storage media as described in claim 15 , the operations of the computing device further including captioning the source digital image and the target digital image with the description of the visual transformation. 17. The one or more non-transitory computer-readable storage media as described in claim 15 , wherein the description of the visual transformation includes an image editing instruction. 18. The one or more non-transitory computer-readable storage media as described in claim 15 , wherein the description of the visual transformation includes a description of an object depicted in the source digital image or the target digital image. 19. The one or more non-transitory computer-readable storage media as described in claim 15 , wherein the target digital image includes an object that is excluded from the source digital image. 20. The one or more non-transitory computer-readable storage media as described in claim 15 , wherein the target digital image excludes an object that is included in the source digital image.
Natural language generation · CPC title
in augmented reality scenes · CPC title
using neural networks · CPC title
Matching criteria, e.g. proximity measures · CPC title
Text processing (natural language analysis G06F40/20; semantic analysis G06F40/30; processing or translation of natural language G06F40/40) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.