What technology area does this patent fall under?

Primary CPC classification H04N19/172. Mapped technology areas include Electricity.

When was this patent published?

Publication date Tue May 13 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Hierarchical video encoders

US12301847B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12301847-B2
Application number	US-202318529173-A
Country	US
Kind code	B2
Filing date	Dec 5, 2023
Priority date	Jan 29, 2021
Publication date	May 13, 2025
Grant date	May 13, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for generating video representations utilizing a hierarchical video encoder includes obtaining a video, wherein the video includes a plurality of frames, processing each of the plurality of frames with a machine-learned frame-level encoder model to respectively generate a plurality of frame representations for the plurality of frames, the plurality of frame representations respective to the plurality of frames determining a plurality of segment representations representative of a plurality of video segments including one or more of the plurality of frames, the plurality of segment representations based at least in part on the plurality of frame representations, processing the plurality of segment representations with a machine-learned segment-level encoder model to generate a plurality of contextualized segment representations, determining a video representation based at least in part on the plurality of contextualized segment representations, and providing the video representation as an output.

First claim

Opening claim text (preview).

What is claimed is: 1. A computing system, the system comprising: one or more processors; one or more non-transitory computer readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining a training dataset, wherein the training dataset comprises a search query, a ground-truth video, and a negative video-query pair, wherein the ground-truth video is responsive to the search query; processing the ground-truth video with a machine-learned hierarchical video encoder model to generate a plurality of contextualized segment representations, wherein each contextualized segment representation of the plurality of contextualized segment representations comprise segment-level semantic information for a respective video segment; determining a first video-query compatibility score based on the search query and the plurality of contextualized segment representations; determining a second video-query compatibility score based on a respective video representation and respective query of the negative video-query pair; evaluating a loss function that evaluates a difference between the first video-query compatibility score and the second video-query compatibility score; and adjusting one or more parameters of the machine-learned hierarchical video encoder model based at least in part on the loss function. 2. The system of claim 1 , wherein the loss function comprises a negative log-likelihood loss. 3. The system of claim 1 , wherein the operations comprise: obtaining a user query; and determining a moment localization that is responsive to the user query based on processing the user query and one or more videos with the machine-learned hierarchical video encoder model. 4. The system of claim 3 , wherein the moment localization comprises a plurality of sequential frames from a beginning frame to an end frame, wherein the beginning frame comprises a first temporal slice of a matching video, and wherein the end frame comprises a second temporal slice of the matching video. 5. The system of claim 1 , wherein the machine-learned hierarchical video encoder model comprises one or more cross-attentional transformer models. 6. One or more non-transitory computer readable media that collectively store instructions that, when executed by one or more processors, cause a computing system to perform operations, the operations comprising: obtaining a training dataset, wherein the training dataset comprises a search query, a ground-truth video, and a ground-truth moment localization, wherein the ground-truth video and the ground-truth moment localization are responsive to the search query, wherein the ground-truth video comprises a plurality of frames; processing the plurality of frames with a hierarchical video encoder model to generate a plurality of contextualized segment representations, wherein each contextualized segment representation of the plurality of contextualized segment representations comprise segment-level semantic information for a respective video segment of a plurality of video segments for the ground-truth video; processing the search query and the plurality of contextualized segment representations to determine a video-query compatibility score and a predicted moment localization, wherein the predicted moment localization comprises a particular video segment from the plurality of video segments; evaluating a loss function that evaluates a difference between the predicted moment localization and the ground-truth moment localization; and adjusting one or more parameters of the hierarchical video encoder model based at least in part on the loss function. 7. The one or more non-transitory computer readable media of claim 6 , wherein the hierarchical video encoder model comprises a machine-learned frame-level encoder model and a machine-learned segment-level encoder model. 8. The one or more non-transitory computer readable media of claim 7 , wherein the operations further comprise: obtaining a video, wherein the video comprises a plurality of frames; processing each of the plurality of frames with the machine-learned frame-level encoder model to respectively generate a plurality of frame representations for the plurality of frames, the plurality of frame representations respective to the plurality of frames; determining a plurality of segment representations representative of a plurality of video segments, wherein each of the plurality of video segments comprise a subset of the plurality of frames that comprise a temporally linear sequence of frames; processing the plurality of segment representations with the machine-learned segment-level encoder model to generate a plurality of contextualized segment representations; determining a video representation based at least in part on the plurality of contextualized segment representations. 9. The one or more non-transitory computer readable media of claim 8 , wherein the video representation is generated based on one or more segment representations, wherein the one or more segment representations are determined based on a plurality of frames of a segment associated with the one or more segment representations, wherein the one or more segment representations are generated based at least in part on self-attention for the plurality of frames and cross-attention for the plurality of frames, and, wherein each contextualized segment representation of the plurality of contextualized segment representations comprise segment-level semantic information for a respective video segment. 10. The one or more non-transitory computer readable media of claim 6 , wherein the video representation is generated based at least in part on self-attention for a plurality of video segments of the highest likelihood video and cross-attention for the plurality of video segments, and wherein the plurality of segment representations are based at least in part on the plurality of frame representations.

Assignees

Google Llc

Inventors

Classifications

G06N3/09
Supervised learning · CPC title
H04N19/172Primary
the region being a picture, frame or field · CPC title
G06N20/00
Machine learning · CPC title
G06N3/045
Combinations of networks · CPC title
G06N7/01
Probabilistic graphical models, e.g. probabilistic networks · CPC title

Patent family

Related publications grouped by family.

View patent family 82704221

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12301847B2 cover?: A computer-implemented method for generating video representations utilizing a hierarchical video encoder includes obtaining a video, wherein the video includes a plurality of frames, processing each of the plurality of frames with a machine-learned frame-level encoder model to respectively generate a plurality of frame representations for the plurality of frames, the plurality of frame represe…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification H04N19/172. Mapped technology areas include Electricity.
When was this patent published?: Publication date Tue May 13 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).