Multi-attention machine learning for object detection and classification
US-12299997-B1 · May 13, 2025 · US
US12499555B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12499555-B2 |
| Application number | US-202318227453-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 28, 2023 |
| Priority date | Jul 28, 2023 |
| Publication date | Dec 16, 2025 |
| Grant date | Dec 16, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for video instance segmentation is provided. The method includes inputting a plurality of video frames collected by a sensor of a vehicle to a trained machine learning model to obtain an n-th output from an n-th layer of the trained machine learning model and an n+1-st output from an n+1-st layer of the trained machine learning model, the trained machine learning model comprising a deep learning model and early-exit subnets, and in response to determining that a difference between the n-th output and the n+1-st output is less than a threshold value, controlling the vehicle based on the n+1-st output, the n+1-st output includes information about instances in the plurality of video frames.
Opening claim text (preview).
What is claimed is: 1 . A method for video instance segmentation, the method comprising: inputting a plurality of video frames collected by a sensor of a vehicle to a trained machine learning model to obtain an n-th output from an n-th layer of the trained machine learning model and an n+1-st output from an n+1-st layer of the trained machine learning model, the trained machine learning model comprising a deep learning model and early-exit subnets; and in response to determining that a difference between the n-th output and the n+1-st output is less than a threshold value, controlling the vehicle based on the n+1-st output, the n+1-st output includes information about instances in the plurality of video frames. 2 . The method of claim 1 , wherein the deep learning model is a transformer-based model, the n-th layer of the trained machine learning model is an n-th layer of the transformer-based model, and the n+1-st layer of the trained machine learning model is an n+1-st layer of the transformer-based model. 3 . The method of claim 1 , further comprising: preprocessing video data collected by the sensor of the vehicle; and determining whether the plurality of video frames is the same as or greater than a threshold number; and in response to determining that the plurality of video frames is the same as or greater than a threshold number, inputting the plurality of video frames to the trained machine learning model. 4 . The method of claim 1 , wherein each of the n-th output and the n+1-st output includes instance segmentation masks of the video frames, and the difference between the n-th output and the n+1-st output is determined by comparing boundaries of instance segmentation masks in the n-th output and boundaries of instance segmentation masks in the n+1-st output. 5 . The method of claim 1 , wherein each of the n-th output and the n+1-st output includes instance segmentation of the video frames, and the difference between the n-th output and the n+1-st output is determined by comparing classified objects in the n-th output and classified objects in the n+1-st output. 6 . The method of claim 1 , wherein each of the n-th output and the n+1-st output includes instance segmentation of the video frames, and the difference between the n-th output and the n+1-st output is determined by comparing pixels of instance segmentation masks in the n-th output and pixels of instance segmentation masks in the n+1-st output. 7 . The method of claim 1 , further comprising: training an initial machine learning model to obtain the trained machine learning model by: training the deep learning model of the initial machine learning model using a training data set including a plurality of video frames as input and instance segmentation masks as output; and training the early-exit subnets of the initial machine learning model using a training data set including a plurality of video frames as input and instance segmentation masks as output. 8 . The method of claim 7 , further comprising: optimizing the trained initial machine learning model by removing redundant or unnecessary layers or parameters of the trained initial machine learning model. 9 . A vehicle comprising: a sensor configured to collect a plurality of video frames; and a controller programmed to: input the plurality of video frames collected by the sensor to a trained machine learning model to obtain an n-th output from an n-th layer of the trained machine learning model and an n+1-st output from an n+1-st layer of the trained machine learning model, the trained machine learning model comprising a deep learning model and early-exit subnets; and in response to determining that a difference between the n-th output and the n+1-st output is less than a threshold value, control the vehicle based on the n+1-st output, the n+1-st output includes information about instances in the plurality of video frames. 10 . The vehicle of claim 9 , wherein the deep learning model is a transformer-based model, the n-th layer of the trained machine learning model is an n-th layer of the transformer-based model, and the n+1-st layer of the trained machine learning model is an n+1-st layer of the transformer-based model. 11 . The vehicle of claim 9 , wherein the controller is further programmed to: preprocess video data collected by the sensor; and determine whether the plurality of video frames is the same as or greater than a threshold number; and in response to determining that the plurality of video frames is the same as or greater than a threshold number, input the plurality of video frames to the trained machine learning model. 12 . The vehicle of claim 9 , wherein each of the n-th output and the n+1-st output includes instance segmentation masks of the video frames, and the difference between the n-th output and the n+1-st output is determined by comparing boundaries of instance segmentation masks in the n-th output and boundaries of instance segmentation masks in the n+1-st output. 13 . The vehicle of claim 9 , wherein each of the n-th output and the n+1-st output includes instance segmentation of the video frames, and the difference between the n-th output and the n+1-st output is determined by comparing classified objects in the n-th output and classified objects in the n+1-st output. 14 . The vehicle of claim 9 , wherein each of the n-th output and the n+1-st output includes instance segmentation of the video frames, and the difference between the n-th output and the n+1-st output is determined by comparing pixels of instance segmentation masks in the n-th output and pixels of instance segmentation masks in the n+1-st output. 15 . The vehicle of claim 9 , wherein the vehicle autonomously drives based on the n+1-st output. 16 . A system comprising: a server; and a vehicle comprising: a sensor configured to collect a plurality of video frames; and a processor programmed to: input the plurality of video frames collected by the sensor to a trained machine learning model to obtain an n-th output from an n-th layer of the trained machine learning model and an n+1-st output from an n+1-st layer of the trained machine learning model, the trained machine learning model comprising a deep learning model and early-exit subnets; and in response to determining that a difference between the n-th output and the n+1-st output is less than a threshold value, control the vehicle based on the n+1-st output, the n+1-st output includes information about instances in the plurality of video frames. 17 . The system of claim 16 , wherein the server is further programmed to: train an initial machine learning model to obtain the trained machine learning model by: training the deep learning model of the initial machine learning model using a training data set including a plurality of video frames as input and instance segmentation masks as output; and training the early-exit subnets of the initial machine learning model using a training data set including a plurality of video frames as input and instance segmentation masks as output. 18 . The system of claim 16 , wherein the server is programmed to: optimize the trained initial machine learning model by removing redundant or unnecessary layers or parameters of the trained initial machine learning model. 19 . The system of claim 16 , wherein the deep learning model is a transformer-based model, the n-th layer of the trained machine learning model is an n-th layer of the transformer-based mode
Proximity, similarity or dissimilarity measures · CPC title
using classification, e.g. of video objects · CPC title
Training; Learning · CPC title
exterior to a vehicle by using sensors mounted on the vehicle · CPC title
Artificial neural networks [ANN] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.