Method and apparatus with adaptive object tracking
US-2022138493-A1 · May 5, 2022 · US
US12430776B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12430776-B2 |
| Application number | US-202318355725-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 20, 2023 |
| Priority date | Jul 29, 2022 |
| Publication date | Sep 30, 2025 |
| Grant date | Sep 30, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method and apparatus with object tracking is provided. The method includes generating a mixed filter by fusing a short-term filter with a long-term filter; and performing object tracking on a current frame image based on the mixed filter. The short-term filter is dependent on a prediction of the current frame image in a video sequence, and the long-term filter is a previously generated long-term filter or is generated by optimizing the previously generated long-term filter based on an object template feature pool.
Opening claim text (preview).
What is claimed is: 1. A processor-implemented method, comprising: generating a mixed filter by fusing a short-term filter with a long-term filter; and performing object tracking on a current frame image based on the mixed filter, wherein the short-term filter is dependent on a prediction of the current frame image in a video sequence, and the long-term filter is a previously generated long-term filter or is generated by optimizing the previously generated long-term filter based on an object template feature pool. 2. The method of claim 1 , further comprising, prior to the generating of the mixed filter, predicting the short-term filter based on a first frame image of the video sequence, the current frame image and an auxiliary frame image of the video sequence, wherein the auxiliary frame image is an image frame that has a determined greater tracking success confidence than a first threshold value and is closest to the current frame image in time sequence. 3. The method of claim 2 , wherein the predicting of the short-term filter comprises: extracting features, through a feature extraction network, for a first search region from the first frame image, an auxiliary search region from the auxiliary frame image, and a current search region from the current frame image, and extracting a first deep feature of the first search region, an auxiliary deep feature of the auxiliary search region, and a current deep feature of the current search region; generating an object state encoding vector by performing object state encoding on the first deep feature, a first bounding box of the first frame image with respect to the object, the auxiliary deep feature, and an auxiliary bounding box of the auxiliary frame image with respect to the object; obtaining a current frame encoding vector by performing encoding on the current deep feature; generate a hidden feature using a trained transformer model provided an input based on the object state encoding vector and the current frame encoding vector to thus; and generating the short-term filter by linearly transforming the hidden feature, wherein the first search region is determined according to the first bounding box, the auxiliary search region is determined according to the auxiliary bounding box, and the current search region is determined according to a predicted bounding box of a predicted object based on N number of frame images prior to the current frame image, wherein N is an integer greater than or equal to 1. 4. The method of claim 1 , further comprising, prior to the generating of the mixed filter, in response to the current frame image being determined to be an image frame at a predetermined position in the video sequence, generating the long-term filter by optimizing the previously obtained long-term filter based on the object template feature pool; or in response to the current frame image being determined to not be an image frame at the predetermined position in the video sequence, generating the previously obtained long-term filter as the long-term filter. 5. The method of claim 1 , wherein the optimizing of the previously obtained long-term filter comprises: extracting a predetermined number of deep features and bounding boxes of the object corresponding to respective ones of accumulated deep features from the object template feature pool and determining the extracted deep features and bounding boxes to be a filter training set; and training and/or optimizing, based on the filter training set, the previously obtained long-term filter through a filter optimization algorithm. 6. The method of claim 1 , wherein the generating of the mixed filter by fusing the short-term filter with the long-term filter comprises: generating a short-term object positioning response map and a long-term object positioning response map by respectively performing correlation processing on the current frame image using the short-term filter and the long-term filter; and generating the mixed filter by fusing the short-term filter with the long-term filter according to the short-term object positioning response map and the long-term object positioning response map. 7. The method of claim 6 , wherein the generating of the mixed filter further comprises: evaluating short-term map quality of the short-term object positioning response map, and long-term map quality of the long-term object positioning response map; determining a mixture weight of the short-term filter and a mixture weight of the long-term filter according to a result of comparing a second predetermined threshold value to the short-term map quality and the long-term map quality; and generating the mixed filter by fusing the short-term filter with the long-term filter according to the mixture weight of the short-term filter and the mixture weight of the long-term filter. 8. The method of claim 7 , wherein the determining of the mixture weight of the short-term filter and the mixture weight of the long-term filter comprises: in response to the short-term map quality being determined greater than or equal to the second predetermined threshold value and the long-term map quality is less than the second predetermined threshold value, setting the mixture weight of the short-term filter as 1 and the mixture weight of the long-term filter as 0; in response to the short-term map quality being determined less than the second predetermined threshold value and the long-term map quality is greater than or equal to the second predetermined threshold value, setting the mixture weight of the short-term filter as 0 and the mixture weight of the long-term filter as 1; in response to both the mixture weights of the short-term filter and the long-term map being determined to have respective qualities that are less than the second predetermined threshold value, setting each of the mixture weights as a weight value corresponding to a previously obtained mixed filter; or in response to both the mixture weights of the short-term filter and the long-term map being determined to have respective qualities that are greater than or equal to the second predetermined threshold value, setting each of the mixture weights as a mixture weight of a normalized output of a Softmax activation function of the short-term map quality and the long-term map quality. 9. The method of claim 6 , wherein the generating of the mixed filter further comprises: generating a mixture weight of the short-term filter and a mixture weight of the long-term filter by using a convolutional neural network and a normalization function, according to the short-term object positioning response map and the long-term object positioning response map; and generating the mixed filter by fusing the short-term filter with the long-term filter according to the mixture weight of the short-term filter and the mixture weight of the long-term filter. 10. The method of claim 9 , wherein the generating of the mixture weight of the short-term filter and the mixture weight of the long-term filter further comprises: generating a mixed response map by mixing and processing the short-term object positioning response map and the long-term object positioning response map; extracting a feature from the mixed response map using the convolutional neural network, and generating a mixture weight vector by linearly transforming the extracted feature using a linear transformation layer; and generating the mixture weight of the short-term filter and the mixture weight of the long-term filter by normalizing the mixture weight vector according to a Softmax activation function. 11. The method of claim 1 , wherein the performing of the object tracking further comprises: g
Video; Image sequence · CPC title
Training; Learning · CPC title
Artificial neural networks [ANN] · CPC title
Probabilistic image processing · CPC title
Motion-based segmentation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.