What technology area does this patent fall under?

Primary CPC classification G06T7/215. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 19 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Spatio-temporal action and actor localization

US10896342B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10896342-B2
Application number	US-201816189974-A
Country	US
Kind code	B2
Filing date	Nov 13, 2018
Priority date	Nov 14, 2017
Publication date	Jan 19, 2021
Grant date	Jan 19, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method of pixel-wise localization of an actor and an action in a sequence of frames includes receiving a natural language query describing the action and the actor. The method also includes receiving the sequence of frames. The method further includes localizing the action and the actor in the sequence of frames based on the natural language query.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of pixel-wise localization of an actor and an action in a sequence of frames, comprising: generating a first set of filters based on a natural language query describing the action and the actor; generating a visual representation for each frame of the sequence of frames; generating a response map for each frame of the sequence of frames based on a convolution of the first set of filters and the visual representation of each frame; generating a second set of dynamic filters based on the natural language query and the response map; labeling pixels in each frame of the sequence of frames based on a convolution of the second set of dynamic filters and an up-sampled visual representation of each frame; and localizing the action and the actor in the sequence of frames based on the labeled pixels. 2. The method of claim 1 , further comprising: up-sampling a resolution of the visual representation; and convolving the up-sampled visual representation with the second set of dynamic filters. 3. The method of claim 2 , further comprising repeating the up-sampling and the convolving for a set of resolutions of the visual representation. 4. The method of claim 2 , in which the first set of filters and the second set of dynamic filters are two-dimensional or three-dimensional filters. 5. The method of claim 1 , further comprising controlling an apparatus based on the localized action and actor. 6. An apparatus for pixel-wise localization of an actor and an action in a sequence of frames, the apparatus comprising: means for generating a first set of filters based on a natural language query describing the action and the actor; means for generating a visual representation for each frame of the sequence of frames; means for generating a response map for each frame of the sequence of frames based on a convolution of the first set of filters and the visual representation of each frame; means for generating a second set of dynamic filters based on the natural language query and the response map; means for labeling pixels in each frame of the sequence of frames based on a convolution of the second set of dynamic filters and an up-sampled visual representation of each frame; and means for localizing the action and the actor in the sequence of frames based on the labeled pixels. 7. The apparatus of claim 6 , further comprising: means for up-sampling a resolution of the visual representation; and means for convolving the up-sampled visual representation with the second set of dynamic filters. 8. The apparatus of claim 7 , further comprising means for repeating up-sampling and convolving for a set of resolutions of the visual representation. 9. The apparatus of claim 7 , in which the first set of filters and the second set of dynamic filters are two-dimensional or three-dimensional filters. 10. The apparatus of claim 6 , further comprising means for controlling the apparatus based on the localized action and actor. 11. An apparatus for pixel-wise localization of an actor and an action in a sequence of frames, the apparatus comprising: a memory; and at least one processor coupled to the memory, the at least one processor configured: to generate a first set of filters based on a natural language query describing the action and the actor; to generate a visual representation for each frame of the sequence of frames; to generate a response map for each frame of the sequence of frames based on a convolution of the first set of filters and the visual representation of each frame; to generate a second set of dynamic filters based on the natural language query and the response map; to label pixels in each frame of the sequence of frames based on a convolution of the second set of dynamic filters and an up-sampled visual representation of each frame; and to localize the action and the actor in the sequence of frames based on the labeled pixels. 12. The apparatus of claim 11 , in which the at least one processor is further configured to: up-sample a resolution of the visual representation; and convolve the up-sampled visual representation with the second set of dynamic filters. 13. The apparatus of claim 12 , in which the at least one processor is further configured to up-sample and convolve for a set of resolutions of the visual representation. 14. The apparatus of claim 12 , in which the first set of filters and the second set of dynamic filters are two-dimensional or three-dimensional filters. 15. The apparatus of claim 11 , in which the at least one processor is further configured to control the apparatus based on the localized action and actor. 16. A non-transitory computer-readable medium having program code recorded thereon for pixel-wise localization of an actor and an action in a sequence of frames, the program code executed by a processor and comprising: program code to generate a first set of filters based on a natural language query describing the action and the actor; program code to generate a visual representation for each frame of the sequence of frames; program code to generate a response map for each frame of the sequence of frames based on a convolution of the first set of filters and the visual representation of each frame; program code to generate a second set of dynamic filters based on the natural language query and the response map; program code to label pixels in each frame of the sequence of frames based on a convolution of the second set of dynamic filters and an up-sampled visual representation of each frame; and program code to localize the action and the actor in the sequence of frames based on the labeled pixels. 17. The non-transitory computer-readable medium of claim 16 , in which the program code further comprises: program code to up-sample a resolution of the visual representation; and program code to convolve the up-sampled visual representation with the second set of dynamic filters. 18. The non-transitory computer-readable medium of claim 17 , in which the program code further comprises program code to up-sample and program code to convolve for a set of resolutions of the visual representation. 19. The non-transitory computer-readable medium of claim 17 , in which the first set of filters and the second set of dynamic filters are two-dimensional or three-dimensional filters. 20. The non-transitory computer-readable medium of claim 16 , in which the program code further comprises program code to control an apparatus based on the localized action and actor.

Assignees

Qualcomm Inc

Inventors

Classifications

G06T7/215Primary
Motion-based segmentation · CPC title
G06V40/171
Local features and components; Facial parts (eye characteristics G06V40/18); Occluding parts, e.g. glasses; Geometrical relationships · CPC title
G06V40/166
using acquisition arrangements · CPC title
G06F16/73
Querying · CPC title
G11B27/00
Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel · CPC title

Patent family

Related publications grouped by family.

View patent family 66431299

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10896342B2 cover?: A method of pixel-wise localization of an actor and an action in a sequence of frames includes receiving a natural language query describing the action and the actor. The method also includes receiving the sequence of frames. The method further includes localizing the action and the actor in the sequence of frames based on the natural language query.
Who is the assignee on this patent?: Qualcomm Inc
What technology area does this patent fall under?: Primary CPC classification G06T7/215. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 19 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Natural language object tracking

Category prediction from semantic image clustering

Enhancing user queries using implicit indicators

Action localization in sequential data with attention proposals from a recurrent network

Semantic object tagging through name annotation

Frequently asked questions