Systems and methods for efficient video instance segmentation for vehicles using edge computing

US12499555B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12499555-B2
Application numberUS-202318227453-A
CountryUS
Kind codeB2
Filing dateJul 28, 2023
Priority dateJul 28, 2023
Publication dateDec 16, 2025
Grant dateDec 16, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for video instance segmentation is provided. The method includes inputting a plurality of video frames collected by a sensor of a vehicle to a trained machine learning model to obtain an n-th output from an n-th layer of the trained machine learning model and an n+1-st output from an n+1-st layer of the trained machine learning model, the trained machine learning model comprising a deep learning model and early-exit subnets, and in response to determining that a difference between the n-th output and the n+1-st output is less than a threshold value, controlling the vehicle based on the n+1-st output, the n+1-st output includes information about instances in the plurality of video frames.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for video instance segmentation, the method comprising: inputting a plurality of video frames collected by a sensor of a vehicle to a trained machine learning model to obtain an n-th output from an n-th layer of the trained machine learning model and an n+1-st output from an n+1-st layer of the trained machine learning model, the trained machine learning model comprising a deep learning model and early-exit subnets; and in response to determining that a difference between the n-th output and the n+1-st output is less than a threshold value, controlling the vehicle based on the n+1-st output, the n+1-st output includes information about instances in the plurality of video frames. 2 . The method of claim 1 , wherein the deep learning model is a transformer-based model, the n-th layer of the trained machine learning model is an n-th layer of the transformer-based model, and the n+1-st layer of the trained machine learning model is an n+1-st layer of the transformer-based model. 3 . The method of claim 1 , further comprising: preprocessing video data collected by the sensor of the vehicle; and determining whether the plurality of video frames is the same as or greater than a threshold number; and in response to determining that the plurality of video frames is the same as or greater than a threshold number, inputting the plurality of video frames to the trained machine learning model. 4 . The method of claim 1 , wherein each of the n-th output and the n+1-st output includes instance segmentation masks of the video frames, and the difference between the n-th output and the n+1-st output is determined by comparing boundaries of instance segmentation masks in the n-th output and boundaries of instance segmentation masks in the n+1-st output. 5 . The method of claim 1 , wherein each of the n-th output and the n+1-st output includes instance segmentation of the video frames, and the difference between the n-th output and the n+1-st output is determined by comparing classified objects in the n-th output and classified objects in the n+1-st output. 6 . The method of claim 1 , wherein each of the n-th output and the n+1-st output includes instance segmentation of the video frames, and the difference between the n-th output and the n+1-st output is determined by comparing pixels of instance segmentation masks in the n-th output and pixels of instance segmentation masks in the n+1-st output. 7 . The method of claim 1 , further comprising: training an initial machine learning model to obtain the trained machine learning model by: training the deep learning model of the initial machine learning model using a training data set including a plurality of video frames as input and instance segmentation masks as output; and training the early-exit subnets of the initial machine learning model using a training data set including a plurality of video frames as input and instance segmentation masks as output. 8 . The method of claim 7 , further comprising: optimizing the trained initial machine learning model by removing redundant or unnecessary layers or parameters of the trained initial machine learning model. 9 . A vehicle comprising: a sensor configured to collect a plurality of video frames; and a controller programmed to: input the plurality of video frames collected by the sensor to a trained machine learning model to obtain an n-th output from an n-th layer of the trained machine learning model and an n+1-st output from an n+1-st layer of the trained machine learning model, the trained machine learning model comprising a deep learning model and early-exit subnets; and in response to determining that a difference between the n-th output and the n+1-st output is less than a threshold value, control the vehicle based on the n+1-st output, the n+1-st output includes information about instances in the plurality of video frames. 10 . The vehicle of claim 9 , wherein the deep learning model is a transformer-based model, the n-th layer of the trained machine learning model is an n-th layer of the transformer-based model, and the n+1-st layer of the trained machine learning model is an n+1-st layer of the transformer-based model. 11 . The vehicle of claim 9 , wherein the controller is further programmed to: preprocess video data collected by the sensor; and determine whether the plurality of video frames is the same as or greater than a threshold number; and in response to determining that the plurality of video frames is the same as or greater than a threshold number, input the plurality of video frames to the trained machine learning model. 12 . The vehicle of claim 9 , wherein each of the n-th output and the n+1-st output includes instance segmentation masks of the video frames, and the difference between the n-th output and the n+1-st output is determined by comparing boundaries of instance segmentation masks in the n-th output and boundaries of instance segmentation masks in the n+1-st output. 13 . The vehicle of claim 9 , wherein each of the n-th output and the n+1-st output includes instance segmentation of the video frames, and the difference between the n-th output and the n+1-st output is determined by comparing classified objects in the n-th output and classified objects in the n+1-st output. 14 . The vehicle of claim 9 , wherein each of the n-th output and the n+1-st output includes instance segmentation of the video frames, and the difference between the n-th output and the n+1-st output is determined by comparing pixels of instance segmentation masks in the n-th output and pixels of instance segmentation masks in the n+1-st output. 15 . The vehicle of claim 9 , wherein the vehicle autonomously drives based on the n+1-st output. 16 . A system comprising: a server; and a vehicle comprising: a sensor configured to collect a plurality of video frames; and a processor programmed to: input the plurality of video frames collected by the sensor to a trained machine learning model to obtain an n-th output from an n-th layer of the trained machine learning model and an n+1-st output from an n+1-st layer of the trained machine learning model, the trained machine learning model comprising a deep learning model and early-exit subnets; and in response to determining that a difference between the n-th output and the n+1-st output is less than a threshold value, control the vehicle based on the n+1-st output, the n+1-st output includes information about instances in the plurality of video frames. 17 . The system of claim 16 , wherein the server is further programmed to: train an initial machine learning model to obtain the trained machine learning model by: training the deep learning model of the initial machine learning model using a training data set including a plurality of video frames as input and instance segmentation masks as output; and training the early-exit subnets of the initial machine learning model using a training data set including a plurality of video frames as input and instance segmentation masks as output. 18 . The system of claim 16 , wherein the server is programmed to: optimize the trained initial machine learning model by removing redundant or unnecessary layers or parameters of the trained initial machine learning model. 19 . The system of claim 16 , wherein the deep learning model is a transformer-based model, the n-th layer of the trained machine learning model is an n-th layer of the transformer-based mode

Assignees

Inventors

Classifications

  • Proximity, similarity or dissimilarity measures · CPC title

  • G06V10/764Primary

    using classification, e.g. of video objects · CPC title

  • Training; Learning · CPC title

  • exterior to a vehicle by using sensors mounted on the vehicle · CPC title

  • Artificial neural networks [ANN] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12499555B2 cover?
A method for video instance segmentation is provided. The method includes inputting a plurality of video frames collected by a sensor of a vehicle to a trained machine learning model to obtain an n-th output from an n-th layer of the trained machine learning model and an n+1-st output from an n+1-st layer of the trained machine learning model, the trained machine learning model comprising a dee…
Who is the assignee on this patent?
Toyota Eng & Mfg North America, Toyota Motor Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06V10/764. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 16 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).