Frame interpolation via adaptive convolution and adaptive separable convolution
US-2020012940-A1 · Jan 9, 2020 · US
US12288346B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12288346-B2 |
| Application number | US-202017422464-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 14, 2020 |
| Priority date | Jan 15, 2019 |
| Publication date | Apr 29, 2025 |
| Grant date | Apr 29, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and storage media are described for motion estimation in video frame interpolation. Disclosed embodiments use feature pyramids as image representations for motion estimation and seamlessly integrates them into a deep neural network for frame interpolation. A feature pyramid is extracted for each of two input frames. These feature pyramids are wrapped together with the input frames to the target temporal position according to the inter-frame motion estimated via optical flow. A frame synthesis network is used to predict interpolation results from the pre-warped feature pyramids and input frames. The feature pyramid extractor and the frame synthesis network are jointly trained for the task of frame interpolation. An extensive quantitative and qualitative evaluation demonstrates that the described embodiments utilizing feature pyramids enables robust, high-quality video frame interpolation. Other embodiments may be described and/or claimed.
Opening claim text (preview).
The invention claimed is: 1. An apparatus configured to operate a frame interpolation neural network (FINN), the apparatus comprising: optical flow estimation (OFE) circuitry configured to estimate a forward optical flow and a backward optical flow from a first input frame and a second input frame of a video; feature pyramid extraction (FPE) circuitry configured to extract a first feature pyramid from the first input frame and a second feature pyramid from the second input frame; warping circuitry configured to warp the first feature pyramid and the first input frame to a target temporal position between the first and second input frames using the forward optical flow, and warp the second feature pyramid and the second input frame to the target temporal position using the backward optical flow; and frame synthesis neural network (FSN) circuitry configured to generate an interpolated output frame at the target temporal position guided by the warped first and second feature pyramids and the warped first and second input frames. 2. The apparatus of claim 1 , wherein the FPE circuitry is further configured to apply a same configuration to the first and second input frames to extract the first and second feature pyramids, respectively. 3. The apparatus of claim 1 , wherein: the first feature pyramid includes a first set of features extracted from the first input frame at each resolution of a plurality of resolutions; the second feature pyramid includes a second set of features extracted from the second input frame at each resolution of the plurality of resolutions; and at least some features in the first set of features and at least some features in the second set of features are based on a color space of the first and second input frames. 4. The apparatus of claim 1 , wherein the interpolated output frame includes pixels of the first and the second input frames shifted from the first and second input frames, respectively, to replicate motion to take place from the first input frame to the target temporal position and from the target temporal position to the second input frame. 5. The apparatus of claim 1 , wherein the FPE circuitry is further configured to: generate the first and second input frames at each of a plurality of resolutions based on features extracted from the first and second input frames. 6. The apparatus of claim 1 , wherein, to extract the first and second feature pyramids, the FPE circuitry is further configured to: read a number of input features from the first and second input frames at each resolution of a plurality of resolutions; and produce a number of output features from the number of input features for each of the first and second input frames. 7. The apparatus of claim 6 , wherein the FPE circuitry comprises: convolutional circuitry interleaved with activation function circuitry and configured to convolve one or both of the first and second input frames at each resolution of a plurality of resolutions to extract a set of features from the first and second input frames at each resolution of the plurality of resolutions. 8. The apparatus of claim 3 , wherein the FPE circuitry is further configured to: use the interpolated output frame to extract new feature pyramids from respective input frames, the new feature pyramids including a set of features different than the features of the first and second feature pyramids. 9. The apparatus of claim 1 , wherein the FSN circuitry comprises a grid of processing blocks, wherein each row in the grid of processing blocks corresponds to a resolution of a set of resolutions of the first and second feature pyramids. 10. The apparatus of claim 9 , wherein a first processing block in each row is configured to receive a warped set of features at the corresponding resolution in the first and second feature pyramids. 11. The apparatus of claim 1 , wherein the OFE circuitry, the FPE circuitry, the FSN circuitry, and the warping circuitry FW circuitry are coupled to one another via an interconnect technology, and implemented as: respective dies of a System-in-Package (SiP) or Multi-Chip Package (MCP); respective execution units or processor cores of a general purpose processor; or respective digital signal processors (DSPs), field-programmable gate arrays (FPGAs), Application Specific Integrated Circuits (ASICs), programmable logic devices (PLDs), System-on-Chips (SoCs), Graphics Processing Units (GPUs), SiPs, MCPs, or any combination of DSPs, FPGAs, ASICs, PLDs, SoCs, GPUs, SiPs, and MCPs. 12. One or more non-transitory computer-readable media (NTCRM) comprising instructions of a frame interpolation neural network (FINN), wherein execution of the instructions by one or more processors is to cause the one or more processors to: obtain a first input frame and a second input frame of a video; estimate a forward optical flow and a backward optical flow from the first and second input frames, the forward optical flow indicating how pixels in the first input frame are to be changed to produce the second input frame during a time period starting from the first input frame and ending at the second input frame, and the backward optical flow indicating how pixels in the second input frame are to be changed to produce the first input frame during a time period starting from the first input frame and ending at the second input frame; extract a first feature pyramid from the first input frame and a second feature pyramid from the second input frame, the first feature pyramid including a first set of features extracted from the first input frame at each resolution of a plurality of resolutions, and the second feature pyramid including a second set of features extracted from the second input frame at each resolution of the plurality of resolutions; warp the first feature pyramid and the first input frame toward a target temporal position between the first and second input frames using the forward optical flow; warp the second feature pyramid and the second input frame toward the target temporal position using the backward optical flow; and generate an output frame at the target temporal position based on the warped first and second feature pyramids and the warped first and second input frames. 13. The one or more NTCRM of claim 12 , wherein the first and second sets of features are based on a color space of the first and second input frames, respectively. 14. The one or more NTCRM of claim 12 , wherein execution of the instructions is to further cause the one or more processors to: read a number of input features from the first and second input frames at each resolution; and generate a number of output features from the number of input features at each resolution, wherein the output features at each resolution represent different octaves of the input features and vary in number. 15. The one or more NTCRM of claim 14 , wherein the FINN comprises a plurality of convolutional functions interleaved with a plurality of activation functions, and execution of the instructions is to cause the one or more processors to: operate the convolutional functions to convolve the first and second input frames at each resolution; and operate the activation functions to extract individual features from the convolved first and second input frames. 16. The one or more NTCRM of claim 12 , wherein the FINN includes a frame synthesis neural network comprising a grid of processing blocks, wherein each row in the grid of processing blocks corresponds to a resolution of the plurality of resolutions, and execution of the instructions is to cause the one or more
Convolutional networks [CNN, ConvNet] · CPC title
Supervised learning · CPC title
Artificial neural networks [ANN] · CPC title
Training; Learning · CPC title
Processor architectures; Processor configuration, e.g. pipelining · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.