Image generation using one or more neural networks
US-2022012568-A1 · Jan 13, 2022 · US
US11854203B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-11854203-B1 |
| Application number | US-202017127399-A |
| Country | US |
| Kind code | B1 |
| Filing date | Dec 18, 2020 |
| Priority date | Dec 18, 2020 |
| Publication date | Dec 26, 2023 |
| Grant date | Dec 26, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In one embodiment, a method includes receiving a first image depicting a context including one or more persons having one or more respective poses, receiving a second image depicting a target person having an original pose, where the target person is to be inserted into the context depicted in the first image, generating a target segmentation mask specifying a new pose for the target person in the context of the first image based on the first image, generating a third image depicting the target person having the new pose based on the second image and the target segmentation mask, and generating an output image based on the first image and the third image, the output image depicting the one or more persons having the one or more respective poses and the target person having the new pose.
Opening claim text (preview).
What is claimed is: 1. A method comprising, by a computing device: receiving a first image depicting a context comprising one or more persons having one or more respective poses; receiving a second image depicting a target person having an original pose, wherein the target person is to be inserted into the context depicted in the first image; generating, based on the first image, a target segmentation mask specifying a new pose for the target person in the context of the first image; generating, based on the second image and the target segmentation mask, a third image depicting the target person having the new pose; and generating an output image based on the first image and the third image, the output image depicting the one or more persons having the one or more respective poses and the target person having the new pose. 2. The method of claim 1 , wherein generating the target segmentation mask comprises: generating a source segmentation mask specifying the one or more respective poses of the one or more persons using one or more pre-trained machine-learning models; and processing the source segmentation mask with a first machine-learning model. 3. The method of claim 2 , wherein a segmentation mask comprises a semantic pose map channel and a face channel. 4. The method of claim 3 , wherein the semantic pose map channel comprises n labels corresponding to n segment groups, wherein n segment groups comprise background, hair, face, torso, upper limbs, upper-body wear, lower-body wear, lower limbs, shoes, or any other suitable segment group. 5. The method of claim 3 , wherein the face channel is extracted based on convex hulls over detected facial key-points for faces in an image, and wherein the face channel is a binary representation. 6. The method of claim 2 , wherein information regarding a bounding box is also provided to the first machine-learning model, wherein the bounding box indicates an area in the first image to which the target person is to be added, and wherein the bounding box is determined by a user. 7. The method of claim 2 , wherein the first machine-learning model is trained with a set of training data, wherein each training data comprises a training source image and a training ground truth image. 8. The method of claim 7 , wherein the set of training data is prepared by: collecting a plurality of training ground truth images, each training ground truth image comprising two or more persons; and generating, for each training ground truth image, a training source image by removing one of the two or more persons. 9. The method of claim 8 , wherein, during a training process of the first machine-learning model, trainable variables of the first machine-learning model are updated based on a comparison of a first target segmentation mask generated by the first machine-learning model based on a training source image and a second target segmentation mask computed from a corresponding training ground truth image. 10. The method of claim 1 , wherein generating the third image comprises: segmenting the target person having the original pose in the second image into k segment classes such that each segment class is captured in a sub-image; generating a latent representation by processing the k sub-images with an encoder of a second machine-learning model; and generating the third image by processing the latent representation and the target segmentation mask by a decoder of the second machine-learning model. 11. The method of claim 10 , wherein k segment classes comprise hair, face, upper-body wear, lower-body-wear, skin, shoes, or any other suitable segment class. 12. The method of claim 10 , wherein the decoder of the second machine-learning model comprises a plurality of up-sample layers with interleaving segmentation mask input layers, and wherein each of the segmentation mask input layers takes the target segmentation mask as an input. 13. The method of claim 12 , wherein the interleaving segmentation mask input layers are SPADE blocks. 14. The method of claim 10 , wherein the decoder of the second machine-learning model also produces a first blending mask, wherein the first blending mask is a binary representation indicating an area in the output image that is to be filled by the target person in the third image. 15. The method of claim 14 , wherein generating the output image comprises compositing the first image multiplied by an inverse of the first blending mask and the third image multiplied by the first blending mask. 16. The method of claim 1 , further comprising: generating a first encoding vector corresponding to a face of the target person having an expression in the context of the first image by processing a face crop of the target person from the output image with an encoder of a third machine-learning model; generating a second encoding vector representing face features of the target person by processing the second image with a pre-trained machine-learning model; generating a temporary image comprising a refined face of the target person by processing the first encoding vector and the second encoding vector with a decoder of the third machine-learning model; and blending the generated refined face into the output image. 17. The method of claim 16 , wherein the refined face has the face features of the target person in the second image and the expression of the face of the target person in the output image. 18. The method of claim 16 , wherein the decoder of the third machine-learning model also produces a second blending mask, wherein the second blending mask represents a blending weight to be applied to the temporary image at each pixel of the output image, and wherein blending the generated refined face into the output image comprises: multiplying an inverse of the second blending mask to the output image; and projecting the temporary image multiplied by the second blending mask to the output image. 19. One or more computer-readable non-transitory storage media embodying software that is operable when executed to: receive a first image depicting a context comprising one or more persons having one or more respective poses; receive a second image depicting a target person having an original pose, wherein the target person is to be inserted into the context depicted in the first image; generate, based on the first image, a target segmentation mask specifying a new pose for the target person in the context of the first image; generate, based on the second image and the target segmentation mask, a third image depicting the target person having the new pose; and generate an output image based on the first image and the third image, the output image depicting the one or more persons having the one or more respective poses and the target person having the new pose. 20. A system comprising: one or more processors; and a non-transitory memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to: receive a first image depicting a context comprising one or more persons having one or more respective poses; receive a second image depicting a target person having an original pose, wherein the target person is to be inserted into the context depicted in the first image; generate, based on the first image, a target segmentation mask specifying a new pose for the target person in the context of the first image; generate, based on the second image and the target segmentation mask, a third image de
Related publications grouped by family.
Answers are generated from the same data shown on this page.