What technology area does this patent fall under?

Primary CPC classification H04N19/30. Mapped technology areas include Electricity.

When was this patent published?

Publication date Tue Jan 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Hierarchical video encoders

US11876986B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11876986-B2
Application number	US-202218070556-A
Country	US
Kind code	B2
Filing date	Nov 29, 2022
Priority date	Jan 29, 2021
Publication date	Jan 16, 2024
Grant date	Jan 16, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for generating video representations utilizing a hierarchical video encoder includes obtaining a video, wherein the video includes a plurality of frames, processing each of the plurality of frames with a machine-learned frame-level encoder model to respectively generate a plurality of frame representations for the plurality of frames, the plurality of frame representations respective to the plurality of frames determining a plurality of segment representations representative of a plurality of video segments including one or more of the plurality of frames, the plurality of segment representations based at least in part on the plurality of frame representations, processing the plurality of segment representations with a machine-learned segment-level encoder model to generate a plurality of contextualized segment representations, determining a video representation based at least in part on the plurality of contextualized segment representations, and providing the video representation as an output.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for moment localization in a video corpus comprising a plurality of videos, the computer-implemented method comprising: obtaining, by a computing system comprising one or more computing devices, a user query, the user query comprising text; identifying, by the computing system, one or more highest likelihood videos of the plurality of videos, each highest likelihood video of the one or more highest likelihood videos identified based at least in part on a video-query compatibility score between the user query and a video representation of the highest likelihood video that is output by a machine-learned hierarchical video encoder model; and determining, by the computing system, a moment localization within a matching video of the one or more highest likelihood videos, the moment localization comprising a moment beginning and a moment end; wherein the moment beginning comprises a beginning frame of the matching video, the beginning frame having a frame representation that is classified as representing a beginning of a moment described by the user query; and wherein the moment end comprises an end frame of the matching video, the end frame having a frame representation that is classified as representing an end of the moment described by the user query. 2. The method of claim 1 , wherein the machine-learned hierarchical video encoder model comprises: a frame-level encoder model configured to receive a plurality of frames of a video as input and provide, in response to receipt of the plurality of frames as input, a plurality of frame representations of the plurality of frames as output; and a segment-level encoder model configured to receive a plurality of segment representations as input and provide, in response to receipt of the plurality of segment representations as input, a plurality of contextualized segment representations as output. 3. The method of claim 2 , wherein at least one of at least one of the frame-level encoder model or the segment-level encoder model is a multimodal encoder configured to produce a plurality of representations based at least in part on associated text, wherein the associated text comprises the user query. 4. The method of claim 2 , wherein the video representation is based at least in part on the plurality of contextualized segment representations. 5. The method of claim 1 , wherein the video representation of the highest likelihood video comprises a highest scoring segment representation of a plurality of segment representations of the highest likelihood video. 6. The method of claim 1 , wherein the one or more highest likelihood videos are selected based at least in part on a negative log-likelihood of the one or more highest likelihood videos containing the moment described by the user query. 7. The method of claim 1 , wherein the moment beginning and the moment end are identified by classifying each frame of the matching video as one of a beginning frame, an end frame, or an other frame. 8. The method of claim 1 , further comprising providing, by the computing system, the moment localization for display to a user. 9. The method of claim 1 , wherein a loss during training of the machine-learned hierarchical video encoder model comprises a contrastive loss between a compatibility score of positive video-query pairs and negative video-query pairs. 10. The method of claim 1 , wherein a loss during training of the machine-learned hierarchical video encoder model comprises a cross-entropy loss between a predicted classification of each frame and a true label of each frame. 11. A computing system, the system comprising: one or more processors; one or more non-transitory computer readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining a user query, the user query comprising text; identifying one or more highest likelihood videos of a plurality of videos, each highest likelihood video of the one or more highest likelihood videos identified based at least in part on a video-query compatibility score between the user query and a video representation of the highest likelihood video that is output by a machine-learned hierarchical video encoder model; and determining a moment localization within a matching video of the one or more highest likelihood videos, the moment localization comprising a moment beginning and a moment end; wherein the moment beginning comprises a beginning frame of the matching video, the beginning frame having a frame representation that is classified as representing a beginning of a moment described by the user query; and wherein the moment end comprises an end frame of the matching video, the end frame having a frame representation that is classified as representing an end of the moment described by the user query. 12. The system of claim 11 , wherein one or more tokens of the user query are masked during training of the machine-learned hierarchical video encoder model. 13. The system of claim 11 , wherein the beginning frame comprises a first temporal slice of the matching video, and wherein the end frame comprises a second temporal slice of the matching video. 14. The system of claim 11 , wherein the moment localization comprises a plurality of sequential frames from the beginning frame to the end frame. 15. The system of claim 11 , wherein the machine-learned hierarchical video encoder model comprises one or more cross-attentional transformer models. 16. One or more non-transitory computer readable media that collectively store instructions that, when executed by one or more processors, cause a computing system to perform operations, the operations comprising: obtaining a user query, the user query comprising text; identifying one or more highest likelihood videos of a plurality of videos, each highest likelihood video of the one or more highest likelihood videos identified based at least in part on a video-query compatibility score between the user query and a video representation of the highest likelihood video that is output by a machine-learned hierarchical video encoder model; and determining a moment localization within a matching video of the one or more highest likelihood videos, the moment localization comprising a moment beginning and a moment end; wherein the moment beginning comprises a beginning frame of the matching video, the beginning frame having a frame representation that is classified as representing a beginning of a moment described by the user query; and wherein the moment end comprises an end frame of the matching video, the end frame having a frame representation that is classified as representing an end of the moment described by the user query. 17. The one or more non-transitory computer readable media of claim 16 , wherein the video representation is generated based on one or more segment representations, wherein the one or more segment representations are determined based on a plurality of frames of a segment associated with the one or more segment representations. 18. The one or more non-transitory computer readable media of claim 17 , wherein the one or more segment representations are generated based at least in part on self-attention for the plurality of frames and cross-attention for the plurality of frames. 19. The one or more non-transitory computer readable media of claim 16 , wherein the video representation is generated based at least in part on self-atte

Assignees

Google Llc

Inventors

Classifications

G06N3/09
Supervised learning · CPC title
H04N19/30Primary
using hierarchical techniques, e.g. scalability (H04N19/63 takes precedence) · CPC title
G06N20/00
Machine learning · CPC title
H04N19/172Primary
the region being a picture, frame or field · CPC title
H04N19/177
the unit being a group of pictures [GOP] · CPC title

Patent family

Related publications grouped by family.

View patent family 82704221

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11876986B2 cover?: A computer-implemented method for generating video representations utilizing a hierarchical video encoder includes obtaining a video, wherein the video includes a plurality of frames, processing each of the plurality of frames with a machine-learned frame-level encoder model to respectively generate a plurality of frame representations for the plurality of frames, the plurality of frame represe…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification H04N19/30. Mapped technology areas include Electricity.
When was this patent published?: Publication date Tue Jan 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).