System and method for unsupervised scene decomposition using spatio-temporal iterative inference

US2021374416A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021374416-A1
Application numberUS-202117336898-A
CountryUS
Kind codeA1
Filing dateJun 2, 2021
Priority dateJun 2, 2020
Publication dateDec 2, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for unsupervised multi-object scene decomposition that involve a spatio-temporal amortized inference model for multi-object video decomposition. Systems and methods involve a new spatio-temporal iterative inference framework to jointly model complex multi-object representations and the explicit temporal dependencies between the frames. Those dependencies improve overall quality of decomposition, encode information about object dynamics and can be used to predict future trajectories of each object separately. Additionally, the model can generate precise estimations and output data even without color information. The model has scene decomposition, segmentation and future prediction capabilities. The processor can use the model to simulate future frames of the scene data.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system for unsupervised multi-object video decomposition comprising: a memory unit storing scene data and a spatio-temporal amortized inference model for unsupervised video decomposition; and a hardware processor that accesses the memory to process the scene data using the spatio-temporal amortized inference model to generate scene decomposition data. 2 . The system of claim 1 wherein the spatio-temporal amortized inference model comprises instructions for refinement steps and time steps and a grid of cells, the cells having a first set of cells and a second set of cells, wherein each cell (r, t) of the first set of cells corresponds an r-th refinement at time t, wherein each cell of the second set of cells corresponds to a final construction with no refinement needed, wherein each cell of the first set of cells receives as input a previous refinement hidden state, a temporal hidden state, and posterior parameters, and generates as output a new hidden state and new posterior parameters. 3 . The system of claim 2 wherein each cell of the first set of cells comprises a spatial broadcast decoder, a multilayer perceptron and a 2D long short term memory unit. 4 . The system of claim 1 wherein the processor decomposes a video sequence into slot sequences and appearance sequences and introduces temporal dependencies into a sequence of posterior refinements for use during decoding with a generative model. 5 . The system of claim 1 wherein the processor generates scene decomposition data comprising a graph or grid with a time dimension and a refinement dimension for the scene data using the spatio-temporal amortized inference model and a 2D long short term memory unit to capture a joint probability over a video sequence of the scene data. 6 . The system of claim 1 wherein the spatio-temporal amortized inference model jointly models multi object representations and temporal dependencies between latent variables across frames of the scene data. 7 . The system of claim 1 wherein the process uses scene decomposition data to encode information about objects' dynamics, and predict trajectories of each object separately. 8 . The system of claim 1 wherein the scene decomposition data provides multi-object representations to decompose a scene into a collection of objects with individual representations, where in each object is represented by a latent vector capturing the object's unique appearance and encoding visual properties comprising color, shape, position, and size, wherein a broadcast decoder generates pixelwise pairs corresponding to an assignment probability and appearance of a pixel for the object, wherein the processor induces a generative image formation model to construct image pixels. 9 . The system of claim 1 wherein the processor uses the spatio-temporal amortized inference model by starting with estimated parameters for an approximate posterior and update the estimated parameters by a series of refinement operations, wherein each refinement operation samples a latent representation and uses an approximate posterior gradient to compute a new parameter estimate using a sequence of convolutional layers and a long short term memory unit that receives as input a hidden state from a previous refinement operation. 10 . The system of claim 1 wherein the processor generates variational estimates from previous refinement steps and temporal information from previous frames of the scene data. 11 . The system of claim 1 wherein the processor trains the model using a variational objective having a first term for a reconstruction error of a single frame and a second term for a divergence between a variational posterior and a prior, wherein a relative weight between both terms is controlled with a hyperparameter. 12 . The system of claim 1 wherein the processor decomposes a static scene into multiple objects and represents each object by a latent vector capturing the object's unique appearance to encode visual properties, wherein, for each latent vector, a broadcast decoder generates pixelwise pairs of assignment probability and appearance of a pixel for an object, wherein the pixelwise pairs induce a generative image formation model, wherein original image pixels can be reconstructed from a probabilistic representation of the image formation model. 13 . The system of claim 1 wherein the processor generates a parameter estimate for an approximate posterior and updates the parameter estimate over a series of refinement steps, wherein each refinement step samples a latent representation from the approximate posterior to evaluate an ELBO and uses gradients for the approximate posterior to compute the updated parameter estimate. 14 . The system of claim 1 wherein the processor generates a parameter estimate, using a function of a sequence of convolutional layers and an long short term memory unit, wherein the long short term memory unit takes as input a hidden state from a previous refinement step. 15 . The system of claim 1 wherein the scene data comprises disentangled, spatially granular representations of objects and wherein the processor generates, for the objects, scene inference data, segmentation data, and prediction data by processing the scene data. 16 . The system of claim 1 wherein the scene data comprises complex visual scenes consisting of multiple moving object instances, wherein the processor uses the spatio-temporal amortized inference model to decouple object appearance and shape. 17 . The system of claim 1 wherein the scene data comprises complex video data depicting multiple objects, wherein the processor uses the spatio-temporal amortized inference model to generate, for each of the multiple objects, object inference data, object segmentation data, and object prediction data. 18 . The system of claim 1 , wherein the scene decomposition data comprises scene inference data, segmentation data, and prediction data for objects of the scene data. 19 . The system of claim 1 , wherein the spatio-temporal amortized inference model captures refinement of an object over time. 20 . The system of claim 1 , wherein the spatio-temporal amortized inference model captures temporal dependencies between latent variables of the scene data across time. 21 . The system of claim 1 , wherein the scene data comprises video data, wherein the spatio-temporal amortized inference model captures temporal dependencies among frames in the video data. 22 . The system of claim 1 , wherein the spatio-temporal amortized inference model comprises a conditional prior for variational inference. 23 . The system of claim 1 , wherein the scene decomposition data comprises segmentation data defining segmentation of objects within the scene data, and wherein the processor infers the segmentation data of objects using interpretable latent representations to decompose each frame of the scene data and simulate future dynamics using an unsupervised process. 24 . The system of claim 1 , wherein the spatio-temporal amortized inference model uses unsupervised learning for multi-object scene decomposition to learn probabilistic dynamics of each object from complex raw video data by introducing temporal dependencies between the random latent variables at each frame. 25 . The system of claim 1 , wherein the memory stores the additional entropy prior and the processor accesses the memory to process t

Assignees

Inventors

Classifications

  • based on distances to training or reference patterns · CPC title

  • Combinations of networks · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Probabilistic or stochastic networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021374416A1 cover?
Systems and methods for unsupervised multi-object scene decomposition that involve a spatio-temporal amortized inference model for multi-object video decomposition. Systems and methods involve a new spatio-temporal iterative inference framework to jointly model complex multi-object representations and the explicit temporal dependencies between the frames. Those dependencies improve overall qual…
Who is the assignee on this patent?
Royal Bank Of Canada
What technology area does this patent fall under?
Primary CPC classification G06V20/41. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 02 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).