What technology area does this patent fall under?

Primary CPC classification G06V40/20. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Jun 01 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Automatic recognition of visual and audio-visual cues

US2023169795A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2023169795-A1
Application number	US-202117539652-A
Country	US
Kind code	A1
Filing date	Dec 1, 2021
Priority date	Dec 1, 2021
Publication date	Jun 1, 2023
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for detecting a cue (e.g., a visual cue or a visual cue combined with an audible cue) occurring together in an input video includes: presenting a user interface to record an example video of a user performing an act including the cue; determining a part of the example video where the cue occurs; applying a feature of the part to a neural network to generate a positive embedding; dividing the input video into a plurality of chunks and applying a feature of each chunk to the neural network to generate a plurality of negative embeddings; applying a feature of a given one of the chunks to the neural network to output a query embedding; and determining whether the cue occurs in the input video from the query embedding, the positive embedding, and the negative embeddings.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for detecting a cue in an input video, the method comprising: presenting a user interface to record an example video of a user performing an act including the cue; determining a part of the example video where the cue occurs; applying a feature of the part to a neural network to generate a positive embedding; dividing the input video into a plurality of chunks and applying a feature of each chunk to a neural network to output a plurality of negative embeddings; applying a feature of a given one of the chunks to the neural network to output a query embedding; and determining whether the cue occurs in the input video from the query embedding, the positive embedding, and the negative embeddings. 2 . The method of claim 1 , wherein the cue includes a visual cue, and the neural network includes a visual embedding network configured to operate on video frames of audiovisual signals to generate visual embeddings. 3 . The method of claim 2 , wherein the cue further includes an audible cue, and the neural network includes an audio embedding network configured to operate on a spectrogram of an audio signal of the audiovisual signals to generate audio embeddings. 4 . The method of claim 1 , the determining of the part of the example video where the cue occurs comprises presenting a user interface to enable the user mark begin and end points within the example video where the corresponding cue occurs. 5 . The method of claim 1 , further comprising appending metadata to the example video including the begin and end points. 6 . The method of claim 1 , further comprises: extracting gestures from a first dataset of labeled gestures; extracting sounds from a second dataset of labeled sounds; combining a random one of the extracted gestures and a random one of the extracted sounds to generate an audio-visual class; repeating the combining until a plurality of audio-visual classes have been generated; and training the neural network to output a numerical vector for each of the plurality of audio-visual classes. 7 . The method of claim 6 , further comprises: randomly selecting a set of the audio-visual classes; choosing a subset of samples within each class of the set as a support set and the remaining samples as queries; applying the subset to the neural network to output training embeddings; applying the queries to the neural network to output query embeddings; and adjusting parameters of the neural network based on the training and query embeddings. 8 . The method of claim 1 , wherein the applying of the feature of the given one chunk to the neural network to output the query embedding comprises: extracting a current frame of the input video; and applying audio-visual features of the current frame to the neural network. 9 . The method of claim 1 , wherein the applying of the feature of the part to the neural network to generate the positive embedding comprises: applying audio features of the feature to a few-shot learning model trained to operate on audio features to output an audio embedding; and applying visual features of the feature to a few-shot learning model trained to operate on visual features to output a video embedding. 10 . The method of claim 1 , wherein the applying of the feature of the part to the neural network to generate the positive embedding comprises applying audio-visual features of the feature to a few-shot learning model trained to operate on a combination of audio and video features to output an audio-visual embedding. 11 . The method of claim 1 , wherein the determining whether the cue occurs in the input video comprises: averaging the negative embeddings to generate an average; determining a first distance from the query embedding to the positive embedding; determining a second distance from the query embedding to the average; determining a probability from the distances; and determining that the cue occurs in the input video when the probability exceeds a threshold. 12 . A system configured to enable a user to create a cue that causes an action to be performed, the system comprising: a client device comprising a user interface configured to enable a user to identify a function to be performed when the cue is recognized and record an example video of the user performing the cue, and a computer program configured to record an input video of the user, wherein the client device outputs the example and input videos across a computer network; and a server configured to receive the example and input videos from the computer network, apply a feature of the example video to a few-shot learning model to output a positive vector, apply features of the entire input video to the few-shot learning model to output a negative vector, apply a feature of a part of the input video to the few-shot learning model to output a query vector, determine whether the cue has been detected in the input video based on the query vector, the positive vector, and the negative vector, and output information across the network to the client device when the cue has been detected, wherein the computer program is configured to perform the function upon receiving the information. 13 . The system of claim 12 , wherein the function causes presentation of motion graphics on a display device. 14 . The system of claim 11 , wherein the user interface is configured to enable the user to mark a start time and an end time of the example video where the cue occurs and the client device outputs information indicating the start and end times across the network to the server. 15 . The system of claim 14 , wherein the server applies audio-visual features of the example video between the start and end times to the few-shot learning model to generate the positive vector, and applies audio-visual features of the entire input video to the few-shot learning model to generate the negative vector. 16 . A method for detecting a gesture and a sound occurring in an input video, the method comprising: presenting a user interface to record an example video of a user performing the gesture and making the sound; determining a first part of the example video where the sound occurs; determining a second part of the example video where the gesture occurs; applying an audio feature of the first part to a first neural network to generate a positive audio embedding; applying a video feature of the second part to a second neural network to generate a positive visual embedding; applying an audio feature of a part of the input video to the first neural network to output a query audio embedding; applying a visual feature of the part to the second neural network to output a query visual embedding; and determining whether the gesture and the sound occur in the input video from the query audio embedding, the query visual embedding, the positive audio embedding, the positive video embedding, and negative embeddings determined from the entire input video. 17 . The method of claim 16 , wherein the determining of whether the gesture and the sound occur comprises: dividing the entire input video into a plurality of chunks; applying an audio feature of each chunk to the first neural network to generate a plurality of negative audio embeddings among the negative embeddings; applying an audio feature of each chunk to the second neural network to generate a plurality of negative visual embeddings among the negative embeddings; and determining whether the gesture and the sound occur together in the input video from the query audio e

Assignees

Adobe Inc

Inventors

Classifications

G06N3/08
Learning methods · CPC title
G06V10/82
using neural networks · CPC title
G06N3/045
Combinations of networks · CPC title
G06V20/41
Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title
G06V40/20Primary
Movements or behaviour, e.g. gesture recognition (recognition of facial expressions G06V40/16) · CPC title

Patent family

Related publications grouped by family.

View patent family 86500515

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023169795A1 cover?: A method for detecting a cue (e.g., a visual cue or a visual cue combined with an audible cue) occurring together in an input video includes: presenting a user interface to record an example video of a user performing an act including the cue; determining a part of the example video where the cue occurs; applying a feature of the part to a neural network to generate a positive embedding; dividi…
Who is the assignee on this patent?: Adobe Inc
What technology area does this patent fall under?: Primary CPC classification G06V40/20. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Jun 01 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).