Stereo matching method and apparatus, image processing apparatus, and training method therefor
US-2018211401-A1 · Jul 26, 2018 · US
US2021019627A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2021019627-A1 |
| Application number | US-202017063997-A |
| Country | US |
| Kind code | A1 |
| Filing date | Oct 6, 2020 |
| Priority date | Sep 14, 2018 |
| Publication date | Jan 21, 2021 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments of this application disclose a target tracking method performed at an electronic device. The electronic device obtains a first video stream and detects candidate regions within a current video frame in the first video stream. The electronic device then extracts, from the candidate regions, a deep feature corresponding to each candidate region and calculates a feature similarity for each candidate region and a deep feature of a target detected in a previous video frame. Finally, the electronic device determines, based on the feature similarity corresponding to the candidate region, that the target is detected in the current video frame. Target detection is performed in a range of video frames by using a target detection model, and target tracking is performed based on the deep feature, so that occurrence of cases such as a target tracking drift or loss can be effectively prevented, to ensure the accuracy of target tracking.
Opening claim text (preview).
What is claimed is: 1 . A target tracking method, performed by an electronic device having a processor and memory connected to the processor and storing processor-executable instructions, the method comprising: obtaining, by the electronic device, a first video stream; detecting, by the electronic device, according to a target detection model and within a current video frame in the first video stream, candidate regions in the current video frame; extracting, by the electronic device, according to a feature extraction model and from the candidate regions, deep features corresponding to the candidate regions, the feature extraction model being an end-to-end neural network model that uses an image as an input and uses a deep feature of a movable body in the image as an output; calculating, by the electronic device, feature similarities between the deep features corresponding to the candidate regions in the current video frame and deep features corresponding to a target detected in a previous video frame; and determining, by the electronic device, that the target is detected in the current video frame according to the feature similarities. 2 . The method according to claim 1 , wherein the determining, by the electronic device that the target is detected in the current video frame according to the feature similarities comprises: selecting a candidate region having a highest feature similarity among the feature similarities corresponding to the candidate regions respectively, as a target region of the current video frame; and determining that the target is detected within the target region in the current video frame. 3 . The method according to claim 1 , wherein the determining, by the electronic device that the target is detected in the current video frame according to the feature similarities comprises: selecting a plurality of candidate regions whose associated feature similarities exceed a threshold according to the feature similarities corresponding to the candidate regions respectively, each of the plurality of candidate regions having an associated move direction; selecting, from the plurality of candidate regions, a candidate region whose associated motion direction most matches a motion direction of the target detected in the previous video frame, as a target region in the current video frame; and determining that the target is detected within the target region in the current video frame. 4 . The method according to claim 1 , wherein the determining, by the electronic device that the target is detected in the current video frame according to the feature similarities comprises: selecting a plurality of candidate regions whose associated feature similarities exceed a threshold according to the feature similarities corresponding to the candidate regions respectively, each of the plurality of candidate regions having an associated distance from a physical location of the target detected in the previous video frame; selecting, from the plurality of candidate regions, a candidate region having a smallest distance from the physical location of the target detected in the previous video frame, as a target region in the current video frame; and determining that the target is detected within the target region in the current video frame. 5 . The method according to claim 1 , wherein the determining, by the electronic device that the target is detected in the current video frame according to the feature similarities comprises: selecting a plurality of candidate regions whose associated feature similarities exceed a threshold according to the feature similarities corresponding to the candidate regions respectively, each of the plurality of candidate regions having an associated physical location and a move direction; selecting, from the plurality of candidate regions, a candidate region whose associated physical location and move direction most match a physical location and a motion direction of the target detected in the previous video frame respectively, as a target region of the current video frame; and determining that the target is detected within the target region in the current video frame. 6 . The method according to claim 1 , wherein the first video stream is shot by a first camera, and the method further comprises: after the tracking target disappears from the first camera: selecting, by the electronic device, a second camera from cameras adjacent to the first camera according to a region in which the target is last detected in the first video stream, the second camera configured to perform cross-screen target tracking; and obtaining, by the electronic device, a second video stream captured by the second camera, and tracking the target in the second video stream. 7 . The method according to claim 1 , wherein a physical location of a candidate region within the current video frame is determined by a location coordinate mapping model. 8 . The method according to claim 7 , wherein the location coordinate mapping model is generated by: calculating, by using a perspective transformation formula, a coordinate mapping matrix according to location coordinates of at least four location points on a preset calibration image and physical location coordinates of the at least four location points on a physical world ground; and generating the location coordinate mapping model according to the coordinate mapping matrix. 9 . The method according to claim 1 , further comprising: drawing, by the electronic device, a motion trajectory of the target on a map according to a physical location of the target detected in the first video stream. 10 . The method according to claim 1 , wherein the target detection model comprises: a basic network and an auxiliary network, wherein the basic network uses a lightweight convolutional neural network mobilenet, and the auxiliary network uses a detection layer formed by a convolution kernel, an input of the auxiliary network being a feature map outputted by different convolutional layers of the basic network. 11 . The method according to claim 1 , further comprising: obtaining, by the electronic device, an image sample, the image sample comprising a human body image and an image tag; and constructing, by the electronic device, a deep convolutional neural network initial model, and training the deep convolutional neural network initial model by using the image sample, to obtain a deep convolutional neural network model meeting a training ending condition as the feature extraction model. 12 . The method according to claim 1 , wherein the current video frame is a video frame located after a first video frame in the first video stream, and the deep feature of the target detected in the previous video frame is a deep feature of a target that is tracked in the first video frame and that is extracted by using the feature extraction model. 13 . An electronic device, comprising a processor, memory connected to the processor, and processor-executable instructions stored in the memory that, when executed by the processor, cause the electronic device to perform a plurality of operations including: obtaining a first video stream; detecting according to a target detection model and within a current video frame in the first video stream, candidate regions in the current video frame; extracting according to a feature extraction model and from the candidate regions, deep features corresponding to the candidate regions, the feature extraction model being an end-to-end neural network model that uses an image as an input and uses a deep feature of a movable body in the image as an output; calculating f
Matching criteria, e.g. proximity measures · CPC title
using feature-based methods, e.g. the tracking of corners or segments · CPC title
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.