Systems and methods for object tracking

US10558891B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10558891-B2
Application numberUS-201815882770-A
CountryUS
Kind codeB2
Filing dateJan 29, 2018
Priority dateJul 30, 2015
Publication dateFeb 11, 2020
Grant dateFeb 11, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed are methods for object tracking. In an example, the method comprises: determining a region of interest (ROI) in a first frame of a video sequences; feeding the determined ROI forward through a first CNN (convolutional network) to obtain a plurality of first feature maps in a higher layer of the CNN and a plurality of second feature maps in a lower layer of the first CNN; selecting a plurality of feature maps from the first and second feature maps, respectively; predicting, based on the selected first and second feature maps, two target heat maps indicating a target location for said objects in the current frame, respectively; and estimating, based on the two predicated target heat maps, a final target location for the object in the current frame.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for object tracking, comprising: determining a region of interest (ROI) in a first frame of a video sequence, wherein the ROI is centered at a ground truth target location for objects to be tracked; feeding the determined ROI forward through a first CNN (convolutional neural network) to obtain a plurality of first feature maps in a higher layer of the CNN and a plurality of second feature maps in a lower layer of the first CNN, wherein the first CNN is pre-trained on an image classification task such that the first feature maps include more semantic features to determine a category for objects to be tracked in the video sequence, while the second feature maps carry more discriminative information to separate the objects from distracters with similar appearance; selecting a plurality of feature maps from the first and second feature maps, respectively; predicting, based on the selected first and second feature maps, two target heat maps indicating a target location for said objects in the first frame, respectively; and estimating, based on the two predicated target heat maps, a final target location for the object in the first frame. 2. The method according to claim 1 , wherein the plurality of feature maps from the first and second feature maps are selected by two sel-CNNs which are pre-trained with the first feature maps and the second maps respectively; and the training of the sel-CNNs comprises: initializing the two sel-CNNs with the first feature maps and the second feature maps, respectively, to output a heat map for the objects in each of the sel-CNNs; comparing the heat map with a ground truth heat map for the objects to obtain a prediction error for each of the sel-CNNs; and back-propagating the error through each of the sel-CNNs until the obtained error is less than a threshold. 3. The method according to claim 2 , wherein the training further comprises: determining a significance for each of those in the first and second feature maps according to the two trained sel-CNNs; ranking those in the first and second feature maps independently in a descending order according to their significance values; and selecting top ranked K feature maps from both higher and lower layers, wherein K is an integer greater than or equal to 1; wherein at an online tracking stage for following frames, the first and second feature maps are extracted from the higher and lower layers of the first CNN respectively and their corresponding K features maps are selected and serve as said selected first and second feature maps. 4. The method according to claim 1 , wherein the predicting comprises: initializing a GNet and a SNet and obtaining target heat maps for the first frame; estimating, by the initialized GNet and SNet, the target heat maps independently for following each frame, wherein the ROI contains both target and background context and is cropped and propagated through the first CNN to obtain the first and second feature maps, and the selected first and second feature maps are propagated through the GNet and the SNet, respectively; and wherein two foreground heat maps are generated by the GNet and the SNet, respectively, and a target localization prediction is performed independently based on the two foreground heat maps. 5. The method according to claim 4 , wherein both GNet and SNet are initialized by: feeding the selected first and second feature maps of the first frame through the GNet and SNet respectively to predict two target heat maps; comparing the predicted heat maps with a ground truth heat map to obtain prediction errors; back-propagating the errors through the GNet and SNet until the obtained errors are less than a threshold; and wherein the ground truth heat map is distributed in accordance with a 2-dimensional Gaussian distribution centered at the ground truth target location with variance proportional to a target size of the objects. 6. The method according to claim 5 , wherein the estimating further comprises: sampling a set of target candidate regions according to a Gaussian distribution centered at the predicted target location in a last frame of the video sequences; predicting a best target candidate in the first frame based on the target heat map estimated by the GNet, wherein the target confidence of each candidate is computed by a summation of heat map values within each of the candidate regions, and the candidate with the highest confidence is selected as the best target candidate, comparing the heat map values within a background region with those in the best candidate region to detect a distracter; if no distracter is detected, the best target location predicted using the heat map from the GNet is determined as a final target location in the current frames, otherwise, a target localization using the specific heat map from the SNet will be utilized to predict the final target location. 7. The method according to claim 4 , wherein each of the GNet and SNet consists of a first convolutional layer and a second convolutional layer nonlinearly connected to the first convolutional layer, wherein the first convolutional layer has kernels of a relatively larger size and the second convolutional layer has kernels of a relatively smaller size. 8. The method according to claim 4 , further comprising: updating the SNet with previous tracking location for the objects in an online fashion to adapt to target appearance changes. 9. A non-transitory computer readable storage medium for storing a computer readable instruction, wherein when the instruction is executed, an operation of each step in the method for object tracking according to claim 1 is implemented. 10. A system for object tracking, comprising: a memory that stores executable instructions; and a processor that executes the executable instructions to perform operations of the system, the operations comprising: determining a region of interest (ROI) in a first frame of a video sequence, wherein the ROI is centered at a ground truth target location for objects to be tracked; and feeding the determined ROI forward through a first CNN (convolutional neural network) to obtain a plurality of first feature maps in a higher layer of the CNN and a plurality of second feature maps in a lower layer of the first CNN; predicting, based on the first and the second feature maps, two target heat maps indicating a target location in the first frame, respectively; and estimating a final target location for the ROI in the first frame, based on the two predicated heat maps. 11. The system according to claim 10 , wherein the first CNN is pre-trained on an image classification task such that the first feature maps include more semantic features to determine a category for objects to be tracked in the video sequence, while the second feature maps carry more discriminative information to separate the objects from distracters with similar appearance. 12. The system according to claim 11 , wherein the predicting of the two target heat maps is based on a plurality of feature maps selected from the first and second feature maps, wherein the plurality of feature maps from the first and second feature maps are selected by two sel-CNNs which are pre-trained with the first feature maps and the second feature maps respectively; and the training of the sel-CNNs comprises: initializing the two sel-CNNs with the first feature maps and the second feature maps, respectively, by inputting the two features maps into the two sel-CNNs respectively to output a heat map for the objects in each of the sel-CNNs; comparing the heat map with a ground truth heat map for the objects to obtain a

Assignees

Inventors

Classifications

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • G06T7/246Primary

    using feature-based methods, e.g. the tracking of corners or segments · CPC title

  • Artificial neural networks [ANN] · CPC title

  • Human being; Person · CPC title

  • involving reference images or patches · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10558891B2 cover?
Disclosed are methods for object tracking. In an example, the method comprises: determining a region of interest (ROI) in a first frame of a video sequences; feeding the determined ROI forward through a first CNN (convolutional network) to obtain a plurality of first feature maps in a higher layer of the CNN and a plurality of second feature maps in a lower layer of the first CNN; selecting a p…
Who is the assignee on this patent?
Beijing Sensetime Tech Development Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06T7/246. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 11 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).