Frame interpolation via adaptive convolution and adaptive separable convolution
US-2020012940-A1 · Jan 9, 2020 · US
US10896356B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10896356-B2 |
| Application number | US-201916409142-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 10, 2019 |
| Priority date | May 10, 2019 |
| Publication date | Jan 19, 2021 |
| Grant date | Jan 19, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system of convolutional neural networks (CNNs) that synthesize middle non-existing frames from pairs of input frames includes a coarse CNN that receives a pair of images acquired at consecutive points of time, a registration module, a refinement CNN, an adder, and a motion-compensated frame interpolation (MC-FI) module. The coarse CNN outputs from the pair of images a previous feature map, a next feature map, a coarse interpolated motion vector field (IMVF) and an occlusion map, the registration module uses the coarse IMVF to warp the previous and next feature maps to be aligned with pixel locations of the IMVF frame, and outputs registered previous and next feature maps, the refinement CNN uses the registered previous and next feature maps to correct the coarse IMVF, and the adder sums the coarse IMVF with the correction and outputs a final IMVF.
Opening claim text (preview).
What is claimed is: 1. A system that uses convolutional neural networks (CNNs) to synthesize middle non-existing frames from pairs of input frames in a given video, comprising: a coarse convolutional neural network (CNN) that receives a pair of images acquired at consecutive points of time, wherein the pair of images includes a previous image and a next image; a registration module connected to the coarse CNN; a refinement CNN connected to the registration module and the coarse CNN; an adder connected to the refinement CNN and the coarse CNN; and a motion-compensated frame interpolation (MC-FI) module connected to the adder and the coarse CNN, wherein the coarse CNN outputs a previous feature map and a next feature map from the previous image and the next image, a coarse interpolated motion vector field (IMVF) and an occlusion map from the pair of images, the registration module uses the coarse IMVF to warp the previous and next feature maps to be aligned with pixel locations of the IMVF frame, and outputs a registered previous feature map and a registered next feature map, the refinement CNN uses the registered previous feature map and a registered next feature map to correct the coarse IMVF, and the adder sums the coarse IMVF with the correction to the IMVF and outputs a final IMVF. 2. The system of claim 1 , wherein the motion-compensated frame interpolation (MC-FI) module generates an interpolated frame corresponding to a time between the time points of the previous frame and the next frame by warping the previous image and the next image using the final IMVF and performing a weighted blending of the warped previous and next images using occlusion weights from the occlusion map. 3. The system of claim 1 , wherein the coarse CNN receives the pair of images in a plurality of resolution levels, wherein the coarse CNN includes a feature extraction sub-network that generates a pair of feature maps that correspond to each image of the pair of images at each level of resolution, an encoder-decoder sub-network that concatenates the pair of feature maps at each level of resolution into a single feature map and processes the single feature map to produce a new feature map with downscaled spatial resolution, a fusion sub-network that merges the new single feature maps at each level of resolution into a single merged feature map by performing a weighted average of the feature maps for each level of resolution wherein the weights are learned in a training phase and differ for each pixel, and an estimation sub-network that outputs horizontal and vertical components of the coarse IMVF and an occlusion map, and wherein the feature extraction sub-network includes Siamese layers. 4. The system of claim 3 , wherein the estimation sub-network includes a horizontal sub-module, a vertical sub-module and an occlusion map sub-module, wherein each sub-module receives the merged feature map output from the fusion sub-network, wherein the horizontal and vertical sub-modules respectively output a horizontal probability map and vertical probability map with S probability values per pixel in each probability map, wherein each probability value represents a probability for a motion vector to be one of S displacement values for that pixel, wherein the horizontal and vertical sub-modules respectively calculate a first moment of the probability values for each pixel to determine expected horizontal and vertical components for each pixel, wherein the pairs of expected horizontal and vertical components for each pixel comprise the coarse IMVF. 5. The system of claim 4 , wherein the occlusion map sub-module outputs the occlusion map, which comprises per-pixel weights for performing a weighted average between the previous image and the next image. 6. The system of claim 3 , wherein the refinement CNN includes an encoder-decoder sub-network that concatenates the registered previous feature map and the registered next feature map and outputs a new set of feature maps with spatial resolution resized with respect to a full resolution of the previous image and the next image, and an estimation sub-network that estimates corrections to the horizontal and vertical components of the coarse IMVF for each block in the registered next and previous feature maps to output the corrected IMVF. 7. The system of claim 6 , wherein the estimation sub-network includes a horizontal sub-module and a vertical sub-module, wherein the horizontal and vertical sub-modules respectively output a horizontal probability map and vertical probability map with S probability values per pixel in each probability map, wherein each probability value represents a probability for a motion vector to be one of S displacement values for that pixel, wherein the horizontal and vertical sub-modules respectively calculate a first moment of the probability values for each pixel to determine expected horizontal and vertical components for each pixel, wherein the pairs of expected horizontal and vertical components for each pixel comprise the correction to the IMVF. 8. A method of using convolutional neural networks (CNNs) to synthesize middle non-existing frames from pairs of input frames in a given video, comprising the steps of: receiving a pyramid representation of a pair of consecutive input frames, wherein the pair of consecutive input frames includes a previous image and a next image, wherein the pyramid representation includes a plurality of pairs of input frames, each at a different spatial resolution level; generating a pair of feature maps from each resolution level of the pyramid representation and estimating a coarse interpolated motion vector field (IMVF) and an occlusion map from each pair of feature maps; registering pairs of feature maps at the same resolution level according to the coarse IMVF and the occlusion map by warping each feature map of the pair of feature maps to be aligned with pixel locations of the coarse IMVF and outputting a registered previous feature map and a registered next feature map; correcting the coarse IMVF using the registered previous feature map and the registered next feature map to generate a correction to the IMVF; adding the correction to the IMVF to the coarse IMVF to generate a refined IMVF; and producing a synthesized middle frame from the pair of consecutive input frames, the refined IMVF and the occlusion map. 9. The method of claim 8 , wherein generating a pair of feature maps comprises generating a pair of features maps for each of the plurality of pairs of input frames at each spatial resolution, wherein each pair of features maps has a spatial resolution downscaled with respect to a resolution of the pair of input frames; concatenating the feature maps at each resolution level and processing the concatenated feature maps to generate a new set of feature maps with downscaled spatial resolution with respect to a resolution of the pair of consecutive input frames, merging the new set of feature maps for all spatial resolution levels into a single merged feature map by performing a weighted average of the feature maps for each level of resolution wherein the weights are learned in a training phase and differ for each pixel; and estimating for each block in the merged feature map horizontal and vertical components of the coarse IMVF, and an occlusion map, wherein the occlusion map comprises per-pixel weights for performing a weighted average between the previous image and the next image. 10. The method of claim 9 , wherein estimating horizontal and vertical components of the coarse IMVF comprises: generating a horizontal probability map and vertical probability map with S probability values per pixel in each probability map, wherei
based on interpolation, e.g. bilinear interpolation (image demosaicing G06T3/4015; edge-driven or edge-based scaling G06T3/403) · CPC title
relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking · CPC title
Classification techniques · CPC title
using neural networks · CPC title
Coarse or fine approaches, e.g. resolution of ambiguities or multiscale approaches · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.