Efficient CNN-based solution for video frame interpolation

US10896356B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10896356-B2
Application numberUS-201916409142-A
CountryUS
Kind codeB2
Filing dateMay 10, 2019
Priority dateMay 10, 2019
Publication dateJan 19, 2021
Grant dateJan 19, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system of convolutional neural networks (CNNs) that synthesize middle non-existing frames from pairs of input frames includes a coarse CNN that receives a pair of images acquired at consecutive points of time, a registration module, a refinement CNN, an adder, and a motion-compensated frame interpolation (MC-FI) module. The coarse CNN outputs from the pair of images a previous feature map, a next feature map, a coarse interpolated motion vector field (IMVF) and an occlusion map, the registration module uses the coarse IMVF to warp the previous and next feature maps to be aligned with pixel locations of the IMVF frame, and outputs registered previous and next feature maps, the refinement CNN uses the registered previous and next feature maps to correct the coarse IMVF, and the adder sums the coarse IMVF with the correction and outputs a final IMVF.

First claim

Opening claim text (preview).

What is claimed is: 1. A system that uses convolutional neural networks (CNNs) to synthesize middle non-existing frames from pairs of input frames in a given video, comprising: a coarse convolutional neural network (CNN) that receives a pair of images acquired at consecutive points of time, wherein the pair of images includes a previous image and a next image; a registration module connected to the coarse CNN; a refinement CNN connected to the registration module and the coarse CNN; an adder connected to the refinement CNN and the coarse CNN; and a motion-compensated frame interpolation (MC-FI) module connected to the adder and the coarse CNN, wherein the coarse CNN outputs a previous feature map and a next feature map from the previous image and the next image, a coarse interpolated motion vector field (IMVF) and an occlusion map from the pair of images, the registration module uses the coarse IMVF to warp the previous and next feature maps to be aligned with pixel locations of the IMVF frame, and outputs a registered previous feature map and a registered next feature map, the refinement CNN uses the registered previous feature map and a registered next feature map to correct the coarse IMVF, and the adder sums the coarse IMVF with the correction to the IMVF and outputs a final IMVF. 2. The system of claim 1 , wherein the motion-compensated frame interpolation (MC-FI) module generates an interpolated frame corresponding to a time between the time points of the previous frame and the next frame by warping the previous image and the next image using the final IMVF and performing a weighted blending of the warped previous and next images using occlusion weights from the occlusion map. 3. The system of claim 1 , wherein the coarse CNN receives the pair of images in a plurality of resolution levels, wherein the coarse CNN includes a feature extraction sub-network that generates a pair of feature maps that correspond to each image of the pair of images at each level of resolution, an encoder-decoder sub-network that concatenates the pair of feature maps at each level of resolution into a single feature map and processes the single feature map to produce a new feature map with downscaled spatial resolution, a fusion sub-network that merges the new single feature maps at each level of resolution into a single merged feature map by performing a weighted average of the feature maps for each level of resolution wherein the weights are learned in a training phase and differ for each pixel, and an estimation sub-network that outputs horizontal and vertical components of the coarse IMVF and an occlusion map, and wherein the feature extraction sub-network includes Siamese layers. 4. The system of claim 3 , wherein the estimation sub-network includes a horizontal sub-module, a vertical sub-module and an occlusion map sub-module, wherein each sub-module receives the merged feature map output from the fusion sub-network, wherein the horizontal and vertical sub-modules respectively output a horizontal probability map and vertical probability map with S probability values per pixel in each probability map, wherein each probability value represents a probability for a motion vector to be one of S displacement values for that pixel, wherein the horizontal and vertical sub-modules respectively calculate a first moment of the probability values for each pixel to determine expected horizontal and vertical components for each pixel, wherein the pairs of expected horizontal and vertical components for each pixel comprise the coarse IMVF. 5. The system of claim 4 , wherein the occlusion map sub-module outputs the occlusion map, which comprises per-pixel weights for performing a weighted average between the previous image and the next image. 6. The system of claim 3 , wherein the refinement CNN includes an encoder-decoder sub-network that concatenates the registered previous feature map and the registered next feature map and outputs a new set of feature maps with spatial resolution resized with respect to a full resolution of the previous image and the next image, and an estimation sub-network that estimates corrections to the horizontal and vertical components of the coarse IMVF for each block in the registered next and previous feature maps to output the corrected IMVF. 7. The system of claim 6 , wherein the estimation sub-network includes a horizontal sub-module and a vertical sub-module, wherein the horizontal and vertical sub-modules respectively output a horizontal probability map and vertical probability map with S probability values per pixel in each probability map, wherein each probability value represents a probability for a motion vector to be one of S displacement values for that pixel, wherein the horizontal and vertical sub-modules respectively calculate a first moment of the probability values for each pixel to determine expected horizontal and vertical components for each pixel, wherein the pairs of expected horizontal and vertical components for each pixel comprise the correction to the IMVF. 8. A method of using convolutional neural networks (CNNs) to synthesize middle non-existing frames from pairs of input frames in a given video, comprising the steps of: receiving a pyramid representation of a pair of consecutive input frames, wherein the pair of consecutive input frames includes a previous image and a next image, wherein the pyramid representation includes a plurality of pairs of input frames, each at a different spatial resolution level; generating a pair of feature maps from each resolution level of the pyramid representation and estimating a coarse interpolated motion vector field (IMVF) and an occlusion map from each pair of feature maps; registering pairs of feature maps at the same resolution level according to the coarse IMVF and the occlusion map by warping each feature map of the pair of feature maps to be aligned with pixel locations of the coarse IMVF and outputting a registered previous feature map and a registered next feature map; correcting the coarse IMVF using the registered previous feature map and the registered next feature map to generate a correction to the IMVF; adding the correction to the IMVF to the coarse IMVF to generate a refined IMVF; and producing a synthesized middle frame from the pair of consecutive input frames, the refined IMVF and the occlusion map. 9. The method of claim 8 , wherein generating a pair of feature maps comprises generating a pair of features maps for each of the plurality of pairs of input frames at each spatial resolution, wherein each pair of features maps has a spatial resolution downscaled with respect to a resolution of the pair of input frames; concatenating the feature maps at each resolution level and processing the concatenated feature maps to generate a new set of feature maps with downscaled spatial resolution with respect to a resolution of the pair of consecutive input frames, merging the new set of feature maps for all spatial resolution levels into a single merged feature map by performing a weighted average of the feature maps for each level of resolution wherein the weights are learned in a training phase and differ for each pixel; and estimating for each block in the merged feature map horizontal and vertical components of the coarse IMVF, and an occlusion map, wherein the occlusion map comprises per-pixel weights for performing a weighted average between the previous image and the next image. 10. The method of claim 9 , wherein estimating horizontal and vertical components of the coarse IMVF comprises: generating a horizontal probability map and vertical probability map with S probability values per pixel in each probability map, wherei

Assignees

Inventors

Classifications

  • G06T3/4007Primary

    based on interpolation, e.g. bilinear interpolation (image demosaicing G06T3/4015; edge-driven or edge-based scaling G06T3/403) · CPC title

  • relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking · CPC title

  • Classification techniques · CPC title

  • using neural networks · CPC title

  • Coarse or fine approaches, e.g. resolution of ambiguities or multiscale approaches · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10896356B2 cover?
A system of convolutional neural networks (CNNs) that synthesize middle non-existing frames from pairs of input frames includes a coarse CNN that receives a pair of images acquired at consecutive points of time, a registration module, a refinement CNN, an adder, and a motion-compensated frame interpolation (MC-FI) module. The coarse CNN outputs from the pair of images a previous feature map, a …
Who is the assignee on this patent?
Samsung Electronics Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06T3/4007. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 19 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).