What technology area does this patent fall under?

Primary CPC classification G06V10/82. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Systems and methods for training machine-learned visual attention models

US12406487B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12406487-B2
Application number	US-202018006078-A
Country	US
Kind code	B2
Filing date	Aug 3, 2020
Priority date	Aug 3, 2020
Publication date	Sep 2, 2025
Grant date	Sep 2, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods of the present disclosure are directed to a method for training a machine-learned visual attention model. The method can include obtaining image data that depicts a head of a person and an additional entity. The method can include processing the image data with an encoder portion of the visual attention model to obtain latent head and entity encodings. The method can include processing the latent encodings with the visual attention model to obtain a visual attention value and processing the latent encodings with a machine-learned visual location model to obtain a visual location estimation. The method can include training the models by evaluating a loss function that evaluates differences between the visual location estimation and a pseudo visual location label derived from the image data and between the visual attention value and a ground truth visual attention label.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for training a machine-learned visual attention model, the method comprising: obtaining, by a computing system comprising one or more computing devices, image data and an associated ground truth visual attention label, wherein the image data depicts at least a head of a person and an additional entity; processing, by the computing system, the image data with an encoder portion of the machine-learned visual attention model to obtain a latent head encoding and a latent entity encoding; processing, by the computing system, the latent head encoding and the latent entity encoding with the machine-learned visual attention model to obtain a visual attention value indicative of whether a visual attention of the person is focused on the additional entity; processing, by the computing system, the latent head encoding and the latent entity encoding with a machine-learned three-dimensional visual location model to obtain a three-dimensional visual location estimation, wherein the three-dimensional visual location estimation comprises an estimated three-dimensional spatial location of the visual attention of the person; evaluating, by the computing system, a loss function that evaluates a difference between the three-dimensional visual location estimation and a pseudo visual location label derived from the image data and a difference between the visual attention value and the ground truth visual attention label; and respectively adjusting, by the computing system, one or more parameters of the machine-learned visual attention model and the machine-learned three-dimensional visual location model based at least in part on the loss function. 2. The computer-implemented method of claim 1 , wherein: the head of the person and the additional entity are respectively defined within the image data by a head bounding box and an entity bounding box; obtaining, by the computing system, the image data further comprises generating, by the computing system, a spatial encoding feature vector based at least in part on a plurality of image data characteristics of the image data, wherein the spatial encoding feature vector comprises a two-dimensional spatial encoding and a three-dimensional spatial encoding; and the spatial encoding feature vector is input alongside the latent space head encoding and the latent space entity encoding to the machine-learned visual attention model to obtain the visual attention value. 3. The computer-implemented method of claim 2 , wherein: the two-dimensional spatial encoding describes one or more of the plurality of image data characteristics; and the plurality of image data characteristics comprise: respective two-dimensional location coordinates within the image data for each of the head bounding box and the entity bounding box; and a height value and a width value of the image data. 4. The computer-implemented method of claim 2 , wherein: the plurality of image data characteristics comprise: respective two-dimensional location coordinates within the image data for each of the head bounding box and the entity bounding box; an estimated camera focal length corresponding to the image data, wherein the estimated camera focal length is based at least in part on a height value and a width value of the image data; respective depth estimates for each of the head of the person and the entity, wherein the respective estimated depths are based at least in part on the estimated camera focal length; and the three-dimensional spatial encoding describes a pseudo three-dimensional relative position of both the head of the person and the additional entity. 5. The computer-implemented method of claim 4 , wherein the pseudo visual location label is based at least in part on the three-dimensional spatial encoding. 6. The computer-implemented method of claim 1 , wherein the additional entity comprises at least a portion of: an object; a person; a direction; a machine-readable visual encoding; a surface; or a space. 7. The computer-implemented method of claim 1 , wherein: the additional entity comprises a head of a second person; and the visual attention value is indicative of whether both the visual attention of the person is focused on the head of the second person and a visual attention of the second person is focused on the head of the person. 8. The computer-implemented method of claim 7 , wherein the three-dimensional visual location estimation comprises the estimated three-dimensional spatial location of the visual attention of the person and an estimated three-dimensional spatial location of the visual attention of the second person. 9. The computer-implemented method of claim 1 , wherein the visual attention value is a binary value. 10. The computer-implemented method of claim 1 , wherein at least one of the machine-learned visual attention model or the machine-learned three-dimensional visual location model comprises one or more convolutional neural networks. 11. The computer-implemented method of claim 1 , wherein: the additional entity comprises a head of a second person; and the method further comprises: obtaining, by the computing system, second image data depicting at least a third head of a third person and a fourth head of a fourth person; processing, by the computing system, the second image data with the machine-learned visual attention model to obtain a second visual attention value, wherein the second visual attention value is indicative of whether both a visual attention of the third person is focused on the fourth person and a visual attention of the fourth person is focused on the third person; and determining, by the computing system based at least in part on the visual attention value, that the third person and the fourth person are looking at each other. 12. A computing system for visual attention tasks, comprising: one or more processors; a machine-learned visual attention model, the machine-learned visual attention model configured to: receive image data depicting at least a head of a person and an additional entity; and generate, based on the image data, a visual attention value, wherein the visual attention value is indicative of whether a visual attention of the person is focused on the additional entity; and one or more tangible, non-transitory computer readable media storing computer-readable instructions that when executed by the one or more processors cause the one or more processors to perform operations, the operations comprising: obtaining image data that depicts the at least the head of the person and the additional entity; processing the image data with the machine-learned visual attention model to obtain the visual attention value indicative of whether the visual attention of the person is focused on the additional entity, wherein the machine-learned visual attention model is trained based at least in part on an output of a machine-learned three-dimensional visual location model, wherein the output of the machine-learned three-dimensional visual location model comprises an estimated three-dimensional spatial location of the visual attention of at least the person; and determining, based at least in part on the visual attention value, whether the person is looking at the additional entity. 13. The computing system of claim 12 , wherein the additional entity comprises: an object; a person; a direction; a machine-readable visual encoding a surface; or a space. 14. The computing system of claim 12 , wherein determining, based at least in part on the visual attention value, whether the person is

Assignees

Google Llc

Inventors

Classifications

G06V10/776
Validation; Performance evaluation · CPC title
G06F18/24
Classification techniques · CPC title
G06V10/462
Salient features, e.g. scale invariant feature transforms [SIFT] · CPC title
G06V10/82Primary
using neural networks · CPC title
G06V10/25Primary
Determination of region of interest [ROI] or a volume of interest [VOI] · CPC title

Patent family

Related publications grouped by family.

View patent family 72148227

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12406487B2 cover?: Systems and methods of the present disclosure are directed to a method for training a machine-learned visual attention model. The method can include obtaining image data that depicts a head of a person and an additional entity. The method can include processing the image data with an encoder portion of the visual attention model to obtain latent head and entity encodings. The method can include…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).