Method and system for latent-space facial feature editing in deep learning based face swapping
US-12277738-B2 · Apr 15, 2025 · US
US12406487B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12406487-B2 |
| Application number | US-202018006078-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 3, 2020 |
| Priority date | Aug 3, 2020 |
| Publication date | Sep 2, 2025 |
| Grant date | Sep 2, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods of the present disclosure are directed to a method for training a machine-learned visual attention model. The method can include obtaining image data that depicts a head of a person and an additional entity. The method can include processing the image data with an encoder portion of the visual attention model to obtain latent head and entity encodings. The method can include processing the latent encodings with the visual attention model to obtain a visual attention value and processing the latent encodings with a machine-learned visual location model to obtain a visual location estimation. The method can include training the models by evaluating a loss function that evaluates differences between the visual location estimation and a pseudo visual location label derived from the image data and between the visual attention value and a ground truth visual attention label.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for training a machine-learned visual attention model, the method comprising: obtaining, by a computing system comprising one or more computing devices, image data and an associated ground truth visual attention label, wherein the image data depicts at least a head of a person and an additional entity; processing, by the computing system, the image data with an encoder portion of the machine-learned visual attention model to obtain a latent head encoding and a latent entity encoding; processing, by the computing system, the latent head encoding and the latent entity encoding with the machine-learned visual attention model to obtain a visual attention value indicative of whether a visual attention of the person is focused on the additional entity; processing, by the computing system, the latent head encoding and the latent entity encoding with a machine-learned three-dimensional visual location model to obtain a three-dimensional visual location estimation, wherein the three-dimensional visual location estimation comprises an estimated three-dimensional spatial location of the visual attention of the person; evaluating, by the computing system, a loss function that evaluates a difference between the three-dimensional visual location estimation and a pseudo visual location label derived from the image data and a difference between the visual attention value and the ground truth visual attention label; and respectively adjusting, by the computing system, one or more parameters of the machine-learned visual attention model and the machine-learned three-dimensional visual location model based at least in part on the loss function. 2. The computer-implemented method of claim 1 , wherein: the head of the person and the additional entity are respectively defined within the image data by a head bounding box and an entity bounding box; obtaining, by the computing system, the image data further comprises generating, by the computing system, a spatial encoding feature vector based at least in part on a plurality of image data characteristics of the image data, wherein the spatial encoding feature vector comprises a two-dimensional spatial encoding and a three-dimensional spatial encoding; and the spatial encoding feature vector is input alongside the latent space head encoding and the latent space entity encoding to the machine-learned visual attention model to obtain the visual attention value. 3. The computer-implemented method of claim 2 , wherein: the two-dimensional spatial encoding describes one or more of the plurality of image data characteristics; and the plurality of image data characteristics comprise: respective two-dimensional location coordinates within the image data for each of the head bounding box and the entity bounding box; and a height value and a width value of the image data. 4. The computer-implemented method of claim 2 , wherein: the plurality of image data characteristics comprise: respective two-dimensional location coordinates within the image data for each of the head bounding box and the entity bounding box; an estimated camera focal length corresponding to the image data, wherein the estimated camera focal length is based at least in part on a height value and a width value of the image data; respective depth estimates for each of the head of the person and the entity, wherein the respective estimated depths are based at least in part on the estimated camera focal length; and the three-dimensional spatial encoding describes a pseudo three-dimensional relative position of both the head of the person and the additional entity. 5. The computer-implemented method of claim 4 , wherein the pseudo visual location label is based at least in part on the three-dimensional spatial encoding. 6. The computer-implemented method of claim 1 , wherein the additional entity comprises at least a portion of: an object; a person; a direction; a machine-readable visual encoding; a surface; or a space. 7. The computer-implemented method of claim 1 , wherein: the additional entity comprises a head of a second person; and the visual attention value is indicative of whether both the visual attention of the person is focused on the head of the second person and a visual attention of the second person is focused on the head of the person. 8. The computer-implemented method of claim 7 , wherein the three-dimensional visual location estimation comprises the estimated three-dimensional spatial location of the visual attention of the person and an estimated three-dimensional spatial location of the visual attention of the second person. 9. The computer-implemented method of claim 1 , wherein the visual attention value is a binary value. 10. The computer-implemented method of claim 1 , wherein at least one of the machine-learned visual attention model or the machine-learned three-dimensional visual location model comprises one or more convolutional neural networks. 11. The computer-implemented method of claim 1 , wherein: the additional entity comprises a head of a second person; and the method further comprises: obtaining, by the computing system, second image data depicting at least a third head of a third person and a fourth head of a fourth person; processing, by the computing system, the second image data with the machine-learned visual attention model to obtain a second visual attention value, wherein the second visual attention value is indicative of whether both a visual attention of the third person is focused on the fourth person and a visual attention of the fourth person is focused on the third person; and determining, by the computing system based at least in part on the visual attention value, that the third person and the fourth person are looking at each other. 12. A computing system for visual attention tasks, comprising: one or more processors; a machine-learned visual attention model, the machine-learned visual attention model configured to: receive image data depicting at least a head of a person and an additional entity; and generate, based on the image data, a visual attention value, wherein the visual attention value is indicative of whether a visual attention of the person is focused on the additional entity; and one or more tangible, non-transitory computer readable media storing computer-readable instructions that when executed by the one or more processors cause the one or more processors to perform operations, the operations comprising: obtaining image data that depicts the at least the head of the person and the additional entity; processing the image data with the machine-learned visual attention model to obtain the visual attention value indicative of whether the visual attention of the person is focused on the additional entity, wherein the machine-learned visual attention model is trained based at least in part on an output of a machine-learned three-dimensional visual location model, wherein the output of the machine-learned three-dimensional visual location model comprises an estimated three-dimensional spatial location of the visual attention of at least the person; and determining, based at least in part on the visual attention value, whether the person is looking at the additional entity. 13. The computing system of claim 12 , wherein the additional entity comprises: an object; a person; a direction; a machine-readable visual encoding a surface; or a space. 14. The computing system of claim 12 , wherein determining, based at least in part on the visual attention value, whether the person is
Validation; Performance evaluation · CPC title
Classification techniques · CPC title
Salient features, e.g. scale invariant feature transforms [SIFT] · CPC title
using neural networks · CPC title
Determination of region of interest [ROI] or a volume of interest [VOI] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.