Unified referring video object segmentation network
US-2021383171-A1 · Dec 9, 2021 · US
US11568543B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11568543-B2 |
| Application number | US-202117197908-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 10, 2021 |
| Priority date | Mar 10, 2021 |
| Publication date | Jan 31, 2023 |
| Grant date | Jan 31, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A device configured for more efficiently processing video images within a set of video image data to detect objects is described herein. The device may include a processor configured to execute a neural network such as a convolutional neural network. The device can receive video image data from a plurality of cameras, such as stationary cameras. The device can acquire a set of sample images from a stationary camera and submit them to a specialized neural network for processing to generate an attention mask. The attention mask can be generated from a variety of methods and is applied to each of the subsequently acquired images form the camera to narrow down areas where the convolutional neural network should process data. The application of attention masks to images within video image data creates masked images that can be processed to detect objects with much greater accuracy and fewer computational resources required.
Opening claim text (preview).
What is claimed is: 1. A device comprising: a processor configured to process video images for object detection by executing a convolutional neural network, the processor being further configured to: receive video image data comprising a series of images for processing; and use a pre-generated attention mask to indicate where processing should occur within the series of images, wherein: the pre-generated attention mask is generated based on a training set of video image data that is processed by a convolutional neural network specialized to output detected object data; and the video image data is pre-processed with the pre-generated attention mask to generate a series of pre-processed masked images by applying the pre-generated attention mask to the video image data; and wherein the neural network is configured to: process the series of pre-processed masked images within areas indicated by the pre-generated attention mask; and generate an output for the series of pre-processed masked images, the output corresponding to the detection of one or more pre-determined objects within the masked images. 2. The device of claim 1 , wherein the detected object data output is a bounding box of the detected object. 3. The device of claim 1 , wherein the detected object data output is a pixel-level segmentation of the detected object. 4. The device of claim 1 wherein the specialized convolutional neural network further outputs semantic region segmentation data. 5. The device of claim 1 , wherein the detected object data output is utilized to update histogram data relating to the location of the detected objects. 6. The device of claim 5 , wherein the histogram data is utilized to generate an attention mask for applying to subsequent video image data. 7. The device of claim 6 , wherein the histogram data is a two-dimensional histogram corresponding to the dimensions of the images within the video image data. 8. The device of claim 7 , wherein the histogram data is utilized to generate a binary output for each pixel within the images within the video image data. 9. The device of claim 8 , wherein the binary output values are generated in relation to a pre-determined threshold value. 10. The device of claim 9 , wherein the pre-determined threshold value is dynamically changed based on a semantic segmentation region generated from the specialized convolutional neural network output. 11. The device of claim 1 , wherein the generation of the attention mask is performed within an external training server communicatively coupled to the device. 12. The device of claim 1 , wherein the received video image data is acquired from a stationary camera, and in response to the movement of the stationary camera, a request for a new attention mask is generated. 13. The device of claim 1 , wherein in response to a pre-determined time threshold being exceeded, the device requests the generation of a new attention mask. 14. The device of claim 1 , wherein the detection of one or more pre-determined objects within the masked images within the video image data generates a notification that further analysis is required. 15. A method of detecting pre-determined objects within video images, comprising: configuring a neural network to receive a series of images for object detection; receiving a sample set of images as video image data; transferring the received sample set of images to a server configured to generate attention masks by processing the received sample set of images through a convolutional neural network specialized to output detected object data; receiving a generated attention mask configured for use with the series of images received from a stationary camera; applying the attention mask to the series of images within the video image data received from the stationary camera to generate a series of masked images; and processing the masked images within the neural network to generate an output indicating the presence of one or more pre-determined objects. 16. The method of claim 15 , wherein the server is a training server, and wherein the transferring to the training server also includes the transmission of configuration data. 17. The method of claim 16 , wherein the configuration data includes threshold value derivation parameters. 18. The method of claim 16 , wherein the training server is selected based on the type of object selected for detection. 19. A device comprising: a processor configured to process and detect objects within video images, by executing a neural network and further comprising: a series of video image data for processing; an attention mask generated based on a training set of video image data processed by a convolutional neural network specialized to output detected object data, wherein an image tensor of the video image data is pre-processed with the attention mask to generate a series of pre-processed masked images; and wherein the neural network is configured to process the series of pre-processed masked images and generate an output for the series of masked images, the output corresponding to the detection of one or more objects within the image data.
structured as a network, e.g. client-server architectures · CPC title
in video content (extracting overlay text G06V20/62; video retrieval G06F16/70; processing of video elementary streams in video servers H04N21/234; processing of video elementary streams in video clients H04N21/44) · CPC title
Physics · mapped topic
Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns · CPC title
Segmentation; Edge detection (motion-based segmentation G06T7/215) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.