Deep learning for dense semantic segmentation in video with automated interactivity and improved temporal coherence
US-11676278-B2 · Jun 13, 2023 · US
US11810359B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11810359-B2 |
| Application number | US-202117557933-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 21, 2021 |
| Priority date | Jan 6, 2021 |
| Publication date | Nov 7, 2023 |
| Grant date | Nov 7, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present invention belongs to the technical field of computer vision, and provides a video semantic segmentation method based on active learning, comprising an image semantic segmentation module, a data selection module based on the active learning and a label propagation module. The image semantic segmentation module is responsible for segmenting image results and extracting high-level features required by the data selection module; the data selection module selects a data subset with rich information at an image level, and selects pixel blocks to be labeled at a pixel level; and the label propagation module realizes migration from image to video tasks and completes the segmentation result of a video quickly to obtain weakly-supervised data. The present invention can rapidly generate weakly-supervised data sets, reduce the cost of manufacture of the data and optimize the performance of a semantic segmentation network.
Opening claim text (preview).
The invention claimed is: 1. A video semantic segmentation method based on active learning, comprising an image semantic segmentation module, a data selection module based on the active learning and a label propagation module; wherein the image semantic segmentation module is responsible for segmenting image results and extracting high-level features required by the data selection module based on active learning; the data selection module based on active learning selects a data subset with rich information at an image level, and selects pixel blocks to be labeled at a pixel level; the label propagation module realizes migration from image to video tasks and completes the segmentation result of a video quickly to obtain weakly-supervised data; (1) Image Semantic Segmentation Module the image semantic segmentation module is composed of an improved full convolutional network; a backbone network architecture adopts Mobilenet v2 structure to extract the features of RGB images; After obtaining high-level feature information, a decoder converts the number of feature channels into the number of categories to achieve the effect of pixel classification; and finally, a semantic label image with classification information of the same size as the RGB images is obtained by upsampling; (1.1) Input of the Image Semantic Segmentation Module: a semantic segmentation network has no size limit on the input RGB images, and a selection strategy at the pixel level needs to fix the size of the images, so the input training data is resized; the input training data is divided into two parts: one part comprises the RGB images denoted as x, and the other part comprises corresponding semantic labels denoted as y; the input data is adjusted in the following way: X=B ( x ) (1) Y=N ( y ) (2) wherein B(x) represents that the RGB images are processed by bilinear interpolation, and N(y) represents that the semantic labels are processed by nearest neighbor interpolation; (1.2) Feature Extraction Encoder Module: the RGB images are feed into the semantic segmentation network; firstly, the number of the channels is converted from 3 channels to 32 channels through an initial convolution layer of which the feature is denoted as F init ; then, a high-level feature with length and width of 16 and 32 is obtained by seven residual convolutions; Bottleneck residual blocks of Mobilenetv2 are used, and the final number of the channels is 320 ; therefore, the level of the high-level feature (HLF) is 16×32×320; the sum of the input and the features that pass through the first 3 Bottleneck residue blocks is used as a low-level feature (LLF); LLF is expressed as: LLF=[ F init ,BN _1( x ), BN _2( x ), BN _3( x )] (3) wherein BN_ 1 ( x ), BN_ 2 ( x ) and BN_ 3 ( x ) represent the features that pass through the first 3 residue blocks respectively; [ ] is concatenation operation; (1.3) Decoder Module: the above high-level feature HLF is sampled by atrous convolution with different sampling rates through an atrous spatial convolution pooling pyramid (ASPP); the sampled feature is fused with the low-level feature LLF and input into the decoder module for decoding the number of the channels, and finally the channel size of the corresponding object category number in the image is obtained; the whole process is described as follows: F decode =DEC ( F ASPP ,LLF) (4) where F ASPP is the associative feature output by the ASPP; DEC represents the decoder module designed by the method; F ASPP passes through the convolution layer to make the level the same as the feature level in the LLF; the two levels are concatenated in the channel level and pass through a deconvolution layer to obtain F decode ; F decode is obtained and then input into a bilinear upsampling layer, so that the feature is converted to the same size as the original RGB image; each pixel on the image corresponds to a predicted category result F class ; (2) Data Selection Module Based on the Active Learning (2.1) Image-Level Data Selection Module: after the RGB image passes through the image semantic segmentation module, a final predicted result F class is obtained, and a middle feature F decode extracted from an encoder module by the method is used as the input of the image-level data selection module; F decode is input into a designed matcher rating network; firstly, a convolution kernel is used as the input feature for level reduction operation of a global pooling layer of the last two levels to obtain a vectorV class with the same size as the number of categories; V class is feed into three full connection layers, and the number of the channels is decreased successively from the number of the categories, 16, 8 and 1 to finally obtain a value S; the closer S is to 0, the better the performance of the selected image in the image semantic segmentation module is; otherwise, the effect is worse; the formula to calculate the loss by the image semantic segmentation network in a training process adopts a cross entropy function, and the function is expressed as formula (5): L seg =−Σ c=1 M y c log( p c ) (5) wherein M represents the number of the categories; y c represents category judgment of variables, which is 1 for the same categories and 0 for different categories; p c represents a predicted probability that an observed sample belongs to category c; after V class is obtained by the data selection module based on the active learning, the MSE loss function of the following formula (7) is designed to improve the performance of the selection module: L pre =( L seg −V class ) 2 (6) wherein L seg is loss obtained during the training of the image semantic segmentation module, and V class is a value obtained by the selection module; a gap between the two is reduced by constant iterative optimization of an optimizer to achieve the purpose of selection and optimization of the selection module; the overall loss function is expressed by the formula (7): L total =L seg +λL pre (7) wherein λ is a hyper parameter used to control the proportion of L pre in the whole loss, and the value of λ ranges from 0 to 1; after the training, fixed parameters are predicted on unlabeled data, and each image obtains a corresponding L pre ; L pre is sequenced to select the first N images with maximum values as data subsets to be labeled in the next round; (2.2) Pixel-Level Data Selection Module: after passing the image-level data selection module, some data subsets to be labeled are selected; the selected data subsets are feed to obtain the distribution of information entropy on each image; the information entropy is calculated by vote entropy, which is improved on the basis of formula (5) and expressed as follows: S ve = 1 D Σ d = 1 D L seg ( 8 ) wherein D represents the frequency of votes and D is set as 20; then, a pixel window of 16*16
Supervised learning · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Active learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.