What technology area does this patent fall under?

Primary CPC classification H04N19/172. Mapped technology areas include Electricity.

When was this patent published?

Publication date Tue Dec 20 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Hierarchical video encoders

US11533495B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11533495-B2
Application number	US-202117162150-A
Country	US
Kind code	B2
Filing date	Jan 29, 2021
Priority date	Jan 29, 2021
Publication date	Dec 20, 2022
Grant date	Dec 20, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for generating video representations utilizing a hierarchical video encoder includes obtaining a video, wherein the video includes a plurality of frames, processing each of the plurality of frames with a machine-learned frame-level encoder model to respectively generate a plurality of frame representations for the plurality of frames, the plurality of frame representations respective to the plurality of frames determining a plurality of segment representations representative of a plurality of video segments including one or more of the plurality of frames, the plurality of segment representations based at least in part on the plurality of frame representations, processing the plurality of segment representations with a machine-learned segment-level encoder model to generate a plurality of contextualized segment representations, determining a video representation based at least in part on the plurality of contextualized segment representations, and providing the video representation as an output.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for generating video representations utilizing a hierarchical video encoder, the method comprising: obtaining, by a computing system comprising one or more computing devices, a video, wherein the video comprises a plurality of frames; processing, by the computing system, each of the plurality of frames with a machine- learned frame-level encoder model to respectively generate a plurality of frame representations for the plurality of frames, the plurality of frame representations respective to the plurality of frames; determining, by the computing system, a plurality of segment representations representative of a plurality of video segments, wherein each of the plurality of video segments comprise a subset of the plurality of frames that comprise a temporally linear sequence of frames, the plurality of segment representations based at least in part on the plurality of frame representations; processing, by the computing system, the plurality of segment representations with a machine-learned segment-level encoder model to generate a plurality of contextualized segment representations, wherein each contextualized segment representation of the plurality of contextualized segment representations comprise segment-level semantic information for a respective video segment; determining, by the computing system, a video representation based at least in part on the plurality of contextualized segment representations; and providing, by the computing system, the video representation and one or more of the plurality of contextualized segment representations as an output. 2. The computer-implemented method of claim 1 , wherein at least one of the frame-level encoder model or the segment-level encoder model is a multimodal encoder configured to produce a plurality of representations based at least in part on associated text; and wherein the method further comprises: processing, by the computing system, the associated text with the machine-learned frame-level encoder model to produce the plurality of frame representations, wherein the plurality of frame representations are based at least in part on the associated text; and processing, by the computing system, the associated text with the machine-learned segment-level encoder model to produce the plurality of contextualized segment representations, wherein the plurality of contextualized segment representations are based at least in part on the associated text. 3. The computer-implemented method of claim 2 , wherein the associated text comprises a user query. 4. The computer-implemented method of claim 2 , wherein the associated text comprises captioning for the video. 5. The computer-implemented method of claim 2 , wherein the associated text is encoded. 6. The computer-implemented method of claim 1 , wherein the machine- learned frame-level encoder model and the machine-learned segment-level encoder model comprise one or more shared parameters. 7. The computer-implemented method of claim 1 , wherein the plurality of segment representations comprise a context token. 8. The computer-implemented method of claim 1 , wherein the plurality of video segments are nonoverlapping. 9. The computer-implemented method of claim 1 , wherein the plurality of video segments have about equal length. 10. A computing system, the computing system comprising: one or more processors; and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining text data, wherein the text data is descriptive of a search query; obtaining a video, wherein the video comprises a plurality of frames; processing the text data and each of the plurality of frames with a machine-learned frame-level encoder model to respectively generate a plurality of frame representations for the plurality of frames, the plurality of frame representations respective to the plurality of frames; determining a plurality of segment representations representative of a plurality of video segments, wherein each of the plurality of video segments comprise a subset of the plurality of frames that comprise a temporally linear sequence of frames, the plurality of segment representations based at least in part on the plurality of frame representations; processing the text data and the plurality of segment representations with a machine-learned segment-level encoder model to generate a plurality of contextualized segment representations; determining a video representation based at least in part on the plurality of contextualized segment representations; and providing the video representation and one or more of the plurality of contextualized segment representations as an output. 11. The computing system of claim 10 , wherein each of the plurality of frame representations comprise frame-level semantic information for a respective frame. 12. The computing system of claim 10 , wherein each of the plurality of contextualized segment representations comprise segment-level semantic information for a respective video segment. 13. The computing system of claim 10 , wherein the video representation comprises video-level semantic information. 14. The computing system of claim 10 , wherein one or more of the plurality of contextualized segment representations comprise coarse-grained semantic information and fine- grained semantic information descriptive of a respective video segment. 15. The computing system of claim 10 , wherein the operations further comprise: determining a starting frame of a video segment based on the plurality of frame representations. 16. The computing system of claim 15 , wherein the operations further comprise: determining an ending frame of the video segment based on the plurality of frame representations. 17. The computing system of claim 16 , wherein one or more of the plurality of segment representations are generated based in part on the starting frame and the ending frame. 18. The computing system of claim 16 , wherein the operations further comprise: providing the starting frame and the ending frame with the video representation and the one or more of the plurality of contextualized segment representations. 19. One or more non-transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations, the operations comprising: obtaining text data, wherein the text data is descriptive of a search query; obtaining a video based on the text data, wherein the video comprises a plurality of frames; processing the text data and each of the plurality of frames with a machine-learned frame-level encoder model to respectively generate a plurality of frame representations for the plurality of frames, the plurality of frame representations respective to the plurality of frames; determining a plurality of segment representations representative of a plurality of video segments, wherein each of the plurality of video segments comprise a subset of the plurality of frames that comprise a temporally linear sequence of frames, the plurality of segment representations based at least in part on the plurality of frame representations; processing the text data and the plurality of segment representations with a machine-learned segment-level encoder model to generate a plurality of contextualized segment representations; det

Assignees

Google Llc

Inventors

Classifications

G06N3/045
Combinations of networks · CPC title
G06N7/01
Probabilistic graphical models, e.g. probabilistic networks · CPC title
H04N19/172Primary
the region being a picture, frame or field · CPC title
H04N19/136
Incoming video signal characteristics or properties · CPC title
G06N20/00
Machine learning · CPC title

Patent family

Related publications grouped by family.

View patent family 82704221

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11533495B2 cover?: A computer-implemented method for generating video representations utilizing a hierarchical video encoder includes obtaining a video, wherein the video includes a plurality of frames, processing each of the plurality of frames with a machine-learned frame-level encoder model to respectively generate a plurality of frame representations for the plurality of frames, the plurality of frame represe…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification H04N19/172. Mapped technology areas include Electricity.
When was this patent published?: Publication date Tue Dec 20 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).