Generative video compression with a transformer-based discriminator

US12530588B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12530588-B2
Application numberUS-202217971546-A
CountryUS
Kind codeB2
Filing dateOct 21, 2022
Priority dateOct 21, 2022
Publication dateJan 20, 2026
Grant dateJan 20, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, an apparatus, and a non-transitory computer-readable storage medium for video compression using a generative adversarial network (GAN) are provided. The method includes obtaining, by a generator of the GAN, a reconstructed target frame based on a reference frame and a raw target frame to be reconstructed; concatenating, by a transformer-based discriminator of the GAN, the reference frame, the raw target frame and the reconstructed target frame to obtain a paired data; determining, by the transformer-based discriminator of the GAN, whether the paired data is real or fake to guide reconstruction of the raw target frame; and determining a generator loss and a transformer-based discriminator loss, and performing gradient back propagation and updating network parameters of the GAN based on the generator loss and the transformer-based discriminator loss.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for video compression, performed by a terminal using a generative adversarial network (GAN), comprising: obtaining, by a generator of the GAN, a reconstructed target frame based on a reference frame and a raw target frame to be reconstructed; concatenating, by a transformer-based discriminator of the GAN, the reference frame, the raw target frame and the reconstructed target frame to obtain a paired data, wherein the transformer-based discriminator is configured to model long-distance dependencies across the reference frame, the raw target frame and the reconstructed target frame; determining, by the transformer-based discriminator of the GAN, whether the paired data is real or fake to guide reconstruction of the raw target frame, wherein the reconstruction of the raw target frame comprises encoding and decoding of target frames; determining a generator loss and a transformer-based discriminator loss, and performing gradient back propagation and updating network parameters of the GAN based on the generator loss and the transformer-based discriminator loss to obtain a trained GAN for the terminal; and compressing, by the trained GAN of the terminal, a video stream for communication, storage, or processing. 2 . The method for video compression of claim 1 , wherein obtaining the reconstructed target frame based on the reference frame and the raw target frame to be reconstructed further comprises: obtaining the reference frame and the raw target frame to be reconstructed; obtaining, by a motion estimation network of the generator, an estimated motion based on the reference frame and the raw target frame; encoding, by a motion encoder network of the generator, the estimated motion to obtain an encoded motion, quantizing the encoded motion into a quantized encoded motion, and converting the quantized encoded motion into a bit stream with entropy encoding; decoding, by a motion decoder network of the generator, the bit stream with entropy decoding, dequantizing and decoding the bit stream to obtain a decoded motion, and warping the decoded motion with the reference frame to obtain a warped target frame; and concatenating the warped target frame, the reference frame and a reconstructed motion together as a tensor, and obtaining a predicted targe frame by feeding the tensor into a motion compensation convolutional neural network. 3 . The method for video compression of claim 2 , further comprising: subtracting the predicted target frame from the raw target frame to obtain a residue; obtaining, by a residue encoder network of the generator, an encoded residue by feeding the residue into the residue encoder network, quantizing the encoded residue into a quantized encoded residue, and converting the quantized encoded residue into a residual bit stream with entropy encoding; and decoding and dequantizing, by a residue decoder network of the generator, the residual bit stream to obtain a reconstructed residue, and adding the reconstructed residue to the predicted target frame to obtain a reconstructed target frame. 4 . The method for video compression of claim 1 , wherein the raw target frame to be reconstructed comprises a plurality of raw target frames to be constructed, and the plurality of raw target frames to be constructed are generated sequentially. 5 . The method for video compression of claim 2 , wherein obtaining the reference frame and the raw target frame to be reconstructed comprises: obtaining a compressed intra I frame and a plurality of raw frames as inputs to the generator; setting the compressed intra I frame as a first reference frame to generate a first reconstructed target P frame; and setting the first reconstructed target P frame as a second reference frame to generate a second reconstructed target P frame. 6 . The method for video compression of claim 1 , wherein concatenating the reference frame, the raw target frame and the reconstructed target frame to obtain a paired data further comprises: concatenating, by the transformer-based discriminator, a quantized encoded motion and a quantized encoded residue together to feed into a Spatial Feature Extractor (SFE) to obtain an extracted feature, and concatenating the extracted feature and an estimated flow to form a condition; and concatenating the raw target frame, the raw reference frame and the condition as a true data, and concatenating the generated target frame, the generated reference frame and the condition as a fake data; and obtaining the paired data comprising the true data and the fake data. 7 . The method for video compression of claim 1 , wherein determining whether the paired data is real or fake further comprises: feeding the paired data into a feature extraction convolutional neural network, and flattening the extract feature to obtain a flattened feature; obtaining a transformed feature by feeding the flattened feature into a transformer block; and determining whether the transformed feature is real or fake by feeding the transformed feature into a multi-layer perceptron head and a sigmoid activation function. 8 . The method for video compression of claim 1 , wherein determining the generator loss and the transformer-based discriminator loss further comprises: determining the generator loss for reconstructing decoded frames by determining five terms, wherein the five terms comprise an adversarial loss term, a distortion loss term, a feature matching loss term, an entropy loss term, and a perceptual loss term; and determining the discriminator loss based on last-layer discriminator probability obtained from both a reconstructed target frame and a raw target frame. 9 . The method for video compression of claim 8 , further comprising: determining the adversarial loss term based on the reconstructed target frame; determining the distortion loss term based on a mean squared error (MSE) between the raw target frame and the reconstructed target frame; determining the feature matching loss term based on a MSE between discriminator features extracted from three scales of the reconstructed target frame and the raw target frame; determining the entropy loss term based on an estimated entropy of a quantized encoded motion and a residue; and determining the perceptual loss term based on a summed MSE between a true data feature and a fake data feature extracted from five different layers. 10 . An apparatus for video compression, for use in a terminal, comprising: one or more processors; and a memory configured to store a generative adversarial network (GAN) comprising a generator and a transformer-based discriminator, the GAN being executable by the one or more processors, wherein the one or more processors, upon execution of the instructions, are configured to: obtain a reconstructed target frame based on a reference frame and a raw target frame to be reconstructed; concatenate the reference frame, the raw target frame and the reconstructed target frame to obtain a paired data, wherein the transformer-based discriminator is configured to model long-distance dependencies across the reference frame, the raw target frame and the reconstructed target frame; determine whether the paired data is real or fake to guide reconstruction of the raw target frame, wherein the reconstruction of the raw target frame comprises encoding and decoding of target frames; determine a generator loss and a transformer-based discriminator loss, and perform gradient back propagation and update network parameters of the GAN based on the generator loss and the transformer-based discriminator loss to obtain a trained GAN for the terminal; and compressing, by the trained GAN of the termi

Assignees

Inventors

Classifications

  • Motion compensation with multiple frame prediction using two or more reference frames in a given prediction direction · CPC title

  • for estimating the reliability of the determined motion vectors or motion vector field, e.g. for smoothing the motion vector field or for correcting motion vectors · CPC title

  • Entropy coding, e.g. variable length coding [VLC] or arithmetic coding · CPC title

  • Probabilistic or stochastic networks · CPC title

  • the region being a picture, frame or field · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12530588B2 cover?
A method, an apparatus, and a non-transitory computer-readable storage medium for video compression using a generative adversarial network (GAN) are provided. The method includes obtaining, by a generator of the GAN, a reconstructed target frame based on a reference frame and a raw target frame to be reconstructed; concatenating, by a transformer-based discriminator of the GAN, the reference fr…
Who is the assignee on this patent?
Univ Santa Clara, Kwai Inc, Beijing Dajia Internet Information Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/084. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 20 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).