High fidelity interactive segmentation for video data with deep convolutional tessellations and context aware skip connections
US-2020160528-A1 · May 21, 2020 · US
US11354906B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11354906-B2 |
| Application number | US-202016846544-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 13, 2020 |
| Priority date | Apr 13, 2020 |
| Publication date | Jun 7, 2022 |
| Grant date | Jun 7, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A Video Semantic Segmentation System (VSSS) is disclosed that performs accurate and fast semantic segmentation of videos using a set of temporally distributed neural networks. The VSSS receives as input a video signal comprising a contiguous sequence of temporally-related video frames. The VSSS extracts features from the video frames in the contiguous sequence and based upon the extracted features, selects, from a set of labels, a label to be associated with each pixel of each video frame in the video signal. In certain embodiments, a set of multiple neural networks are used to extract the features to be used for video segmentation and the extraction of features is distributed among the multiple neural networks in the set. A strong feature representation representing the entirety of the features is produced for each video frame in the sequence of video frames by aggregating the output features extracted by the multiple neural networks.
Opening claim text (preview).
The invention claimed is: 1. A method comprising: extracting, from each video frame in a contiguous sequence of video frames, a group of features using one of a plurality of sub-neural networks, the contiguous sequence of video frames comprising a current video frame and one or more additional video frames occurring in the contiguous sequence prior to the current video frame, wherein the group of features extracted from the current video frame is different from another group of features extracted from the one or more additional video frames in the contiguous sequence of video frames; generating a full feature representation for the current video frame by combining the groups of features extracted from the contiguous sequence of video frames, wherein generating the full feature representation for the current video frame comprises: generating, for each video frame in the one or more additional video frames, an affinity value between pixels of the video frame in the one or more additional video frames and the current video frame; and generating the full feature representation for the current video frame based on the affinity value and the groups of features extracted from the contiguous sequence of video frames; segmenting the current video frame based upon the full feature representation to generate a segmentation result, the segmentation result comprising information identifying, for a pixel in the current video frame, a label selected for the pixel based upon the full feature representation, wherein the label is selected from a plurality of labels; and outputting the segmentation result. 2. The method of claim 1 , wherein the groups of features, extracted from the video frames in the contiguous sequence of video frames, together represent a total set of features used for segmenting the current video frame. 3. The method of claim 1 , wherein the plurality of sub-neural networks comprises a first sub-neural network and a second sub-neural network, the first sub-neural network trained to extract a first group of features from a first video frame in the contiguous sequence of video frames, the second sub-neural network trained to extract a second group of features from a second video frame in the contiguous sequence of video frames, wherein the first video frame is different from the second video frame and the first group of features is different from the second group of features. 4. The method of claim 1 , wherein extracting, from each video frame in the contiguous sequence of video frames, a group of features using a different one of the plurality of sub-neural networks comprises: generating at least one of a value feature map, a query map, or a key map, wherein the value feature map comprises features extracted by a sub-neural network of the plurality of sub-neural networks from the video frame, and the query map and the key map comprise information related to correlations between pixels across the video frames or across adjacent video frames in the contiguous sequence. 5. The method of claim 1 , wherein generating the full feature representation for the current video frame further comprises computing a correlation between pixels of a first video frame in the contiguous sequence and a second video frame in the contiguous sequence, where the first video frame is adjacent to the second video frame in the contiguous sequence and occurs before the second video frame in the contiguous sequence. 6. The method of claim 5 , wherein generating the full feature representation for the current video frame further comprises: (a) comparing the first video frame in the contiguous sequence with the second video frame in the contiguous sequence by computing an attention value between the pixels of the first video frame and the pixels of the second video frame, wherein the attention value measures the correlation between the pixels of the first video frame and the pixels of the second video frame; (b) obtaining a value feature map of the first video frame and a value feature map of the second video frame; and (c) updating the value feature map of the second video frame based on the attention value, the Value feature map of the first video frame and the value feature map of the second video frame. 7. The method of claim 1 , further comprising: (a) comparing a first video frame in the contiguous sequence with a second video frame in the contiguous sequence by computing an attention value between pixels of the first video frame and pixels of the second video frame, wherein the attention value measures a correlation between the pixels of the first video frame and the pixels of the second video frame; (b) obtaining a value feature map of the first video frame and a value feature map of the second video frame; (c) updating the value feature map of the second video frame based on the attention value, the value feature map of the first video frame and the value feature map of the second video frame; (d) updating the contiguous sequence of video frames by removing the first video frame from the contiguous sequence of video frames; and repeating (a), (b), (c) and (d) until only the current video frame is left in the contiguous sequence of video frames. 8. The method of claim 7 , further comprising: determining that only the current video frame is left in the contiguous sequence of video frames; and based on the determining, outputting the value feature map for the current video frame, wherein the value feature map represents the full feature representation for the current video frame. 9. The method of claim 1 , wherein the segmentation result comprises an image of the current video frame, wherein each pixel in the image of the current video frame is colored using a color corresponding to the label associated with the pixel. 10. The method of claim 1 , a feature space representing a plurality of features to be used for segmenting video frames in the contiguous sequence of video frames is divided into a number of groups of features, wherein a number of sub-neural networks in the plurality of sub-neural networks is equal to a number of the groups of features. 11. The method of claim 10 , wherein the number of groups of features is four. 12. The method of claim 1 , wherein a number of layers in each sub-neural network from the plurality of sub-neural networks is the same. 13. The method of claim 12 , wherein: a number of layers in each sub-neural network from the plurality of sub-neural networks is the same; and a number of nodes in each sub-neural network from the plurality of sub-neural networks is the same. 14. A system comprising: a memory storing segmented video frames corresponding to a video signal; and one or more processors configured to perform processing comprising: extracting, from each video frame in a contiguous sequence of video frames, a group of features using one of a plurality of sub-neural networks, the contiguous sequence of video frames comprising a current video frame and one or more additional video frames occurring in the contiguous sequence prior to the current video frame, and wherein the group of features extracted from the current video frame is different from another group of features extracted from the one or more additional video frames in the contiguous sequence of video frames; generating a full feature representation for the current video frame by combining the groups of features extracted from the contiguous sequence of video frames, wherein generating the full feature representation for the current video frame comprises: generating, for each video frame in the one or more additional video frames, an affinity valu
of extracted features · CPC title
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title
Region-based segmentation · CPC title
of extracted features · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.