System and method for forecasting location of target in monocular first person view

US11893751B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11893751-B2
Application numberUS-202117405060-A
CountryUS
Kind codeB2
Filing dateAug 18, 2021
Priority dateSep 9, 2020
Publication dateFeb 6, 2024
Grant dateFeb 6, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This disclosure relates generally to system and method for forecasting location of target in monocular first person view. Conventional systems for location forecasting utilizes complex neural networks and hence are computationally intensive and requires high compute power. The disclosed system includes an efficient and light-weight RNN based network model for predicting motion of targets in first person monocular videos. The network model includes an auto-encoder in the encoding phase and a regularizing layer in the end helps us get better accuracy. The disclosed method relies entirely just on detection bounding boxes for prediction as well as training of the network model and is still capable of transferring zero-shot on a different dataset.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor implemented method for forecasting location of a target in monocular first person view, the method comprising: receiving, via one or more hardware processors, a sequence of past bounding boxes, wherein each bounding box of the sequence of past bounding boxes enclosing the target in a frame from amongst a sequence of frames, wherein the sequence of frames associated with a set of past locations of the target; predicting, via the one or more hardware processors, in real-time, a sequence of future bounding boxes corresponding to future locations of the target based on the sequence of past bounding boxes using a network model, wherein the network model comprising an encoder block having a first Long Short Term Memory (LSTM) and a second LSTM operating collectively as an auto-encoder, a decoder block comprising a third LSTM and a trajectory concatenation layer, wherein predicting the sequence of future bounding boxes comprises: determining, by the encoder block, a representation vector of a predetermined size based on a bounding box information associated with the sequence of past bounding boxes, the bounding box information indicative of a history associated with the set of past locations of the target, wherein the bounding box information comprises a set of vectors associated with the sequence of past bounding boxes, a vector from amongst the set of vectors associated with a bounding box from amongst the sequence of past bounding boxes comprising a centroid, width, height, velocity of the centroid, and a change in the width and the height of the bounding box, wherein determining the representation vector by the encoder block comprises: generating, by the first LSTM, a final hidden state vector which summarizes a complete sequence of bounding box information; mapping, by a fully connected layer associated with the encoder block, the final hidden state vector to the representation vector of predetermined length via a ReLU; generating, by the second LSTM, a set of hidden state vectors in a plurality of iterations, wherein for each iteration, the second LSTM takes the representation vector as input; and passing through a fully connected layer associated with the second LSTM, the set of hidden state vectors generated in each iteration of the plurality of iterations; predicting, by the decoder block, a future velocity and change in dimension of future bounding boxes of the target based on the representation vector, wherein predicting the future velocity and the change in dimension of future bounding boxes by the decoder block comprises: receiving the vector representation from the encoder block; generating, by the third LSTM, a set of hidden state vectors in a plurality of iterations, wherein generating a hidden state vector for a current iteration of the plurality of iterations by the third LSTM comprises: taking, as input, the representation vector and hidden state vectors associated with iterations previous to the current iteration in the plurality of iterations; generating a hidden state vector from amongst the set of hidden state vectors based on the input; and mapping the hidden state vector to a vector of four dimensions indicative of velocity and dimension change components via a ReLU followed by a fully connected layer; and converting, by a trajectory concatenation layer, the future velocities and change in dimensions of future bounding boxes into the sequence of future bounding box of the target, wherein converting future velocities and change in dimensions of the future bounding boxes comprises converting the predicted future velocities of the centroids and the change in the dimension into a sequence of locations and dimension of the sequence of the future bounding boxes using the past bounding box locations. 2. The method of claim 1 , further comprises determining an objective function indicative of minimization of error in reconstructing an input sequence of the bounding box information in reverse order, the objective function represented as: ℒ auto ⁢ ‐ ⁢ enc = ∑ i = k - f f ⁢ ❘ "\[LeftBracketingBar]" I ^ ⊖ I ❘ "\[RightBracketingBar]" k × 8 where, ⊖ represents element-wise vector subtraction operation, and I is the input sequence. 3. The method of claim 1 , further comprising applying supervision on each bounding box of the sequence of future bounding boxes in a predicted sequence of future frames based on a supervision objective function: ℒ traj = ∑ i = f + 1 p ⁢ ❘ "\[LeftBracketingBar]" O ^ ⊖ O ❘ "\[RightBracketingBar]" p × 4 where, O∈ p×4 is a ground truth centroid (cx, cy) and dimension (w, h) of the bounding box in the predicted sequence of p future frames. 4. The method of claim 3 further comprising training the network model by minimizing an objective function: =α auto-enc +β·α traj where, α∈ + and β∈ + are hyper-parameters to determine the importance of a corresponding loss term. 5. The method of claim 1 , further comprising: receiving a video sequence of a scene comprising a target, the video sequence comprising a set of frames corresponding to the set of past locations of the target, the video sequence captured by a monocular camera in a first person view; and determining the sequence of past bounding boxes, each bounding box of the sequence of past bounding boxes associated with a tracking ID. 6. A system for forecasting location of

Assignees

Inventors

Classifications

  • G06T7/215Primary

    Motion-based segmentation · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • using feature-based methods, e.g. the tracking of corners or segments · CPC title

  • Video; Image sequence · CPC title

  • Artificial neural networks [ANN] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11893751B2 cover?
This disclosure relates generally to system and method for forecasting location of target in monocular first person view. Conventional systems for location forecasting utilizes complex neural networks and hence are computationally intensive and requires high compute power. The disclosed system includes an efficient and light-weight RNN based network model for predicting motion of targets in fir…
Who is the assignee on this patent?
Tata Consultancy Services Ltd
What technology area does this patent fall under?
Primary CPC classification G06T7/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 06 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).