What technology area does this patent fall under?

Primary CPC classification G06V10/771. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 13 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Efficiency of vision transformers with adaptive token pruning

US12299960B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12299960-B2
Application number	US-202217978959-A
Country	US
Kind code	B2
Filing date	Nov 1, 2022
Priority date	May 10, 2022
Publication date	May 13, 2025
Grant date	May 13, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and a method are disclosed for training a vision transformer. A token distillation loss of an input image based on a teacher network classification token and a token importance score of a student network (the vision transformer during training) are determined at a pruning layer of the vision transformer. When a current epoch number is odd, sparsification of tokens of the input image is skipped and the dense input image is processed by layers that are subsequent to the pruning layer. When the current epoch number is even, tokens of the input image are pruned at the pruning layer and processed by layers that are subsequent to the pruning layer. A label loss and a total loss for the input image are determined by the subsequent layers and the student network is updated.

First claim

Opening claim text (preview).

What is claimed is: 1. A method to train a vision transformer, the method comprising: determining, at a pruning layer P of the vision transformer, a token distillation loss L distill of an input image based on a teacher network classification token CLS and a token importance score TIS P of a student network at the pruning layer P, the input image being part of an image database used to train the vision transformer for a predetermined number of epochs, and the student network comprising the vision transformer during training; processing the input image by layers of the vision transformer that are subsequent to the pruning layer P by skipping sparsification of tokens of the input image at the pruning layer P based on a current epoch being an odd number; processing the input image by layers of the vision transformer that are subsequent to the pruning layer P by pruning tokens of the input image at the pruning layer P based on the current epoch being an even number; determining a label loss L loss and a total loss L for the input image after processing the input image by layers of the vision transformer that are subsequent to the pruning layer P; and updating the student network of the vision transformer based on the label loss L loss and the total loss L for the input image. 2. The method of claim 1 , wherein pruning tokens of the input image comprises pruning tokens of the input image having a token importance score that is less than a predetermined threshold value. 3. The method of claim 1 , wherein pruning tokens of the input image comprises adaptively pruning tokens of the input image having a token importance score that is less than a predetermined threshold value. 4. The method of claim 3 , wherein pruning tokens of the input image comprises pruning tokens that are not in a group of a minimum number of highest-weighted tokens having token importance scores that sum to be equal to greater than the predetermined threshold value. 5. The method of claim 1 , wherein pruning tokens of the input image at the pruning layer P prunes tokens of the input image using a token mask M. 6. The method of claim 1 , wherein the token distillation loss L distill of the input image is further based on Kullback-Leiber divergence of the teacher network classification token CLS and the token importance score TIS P of the student network. 7. The method of claim 1 , wherein the pruning layer P comprises a third layer of the vision transformer. 8. A vision transformer, comprising a first group of layers outputs a token distillation loss L distill of an input image based on a teacher network classification token CLS and a token importance score TIS P of a student network, the input image being part of an image database used to train the vision transformer for a first predetermined number of epochs, and the student network comprising the vision transformer during training; and a second group of layers that are subsequent to the first group of layers that are trained by: processing the input image by the second group of layers by skipping sparsification of tokens of the input image within the first group of layers based on a current epoch being an odd number, processing the input image by the second group of layers by pruning tokens of the input image within the first group of layers based on the current epoch being an even number, determining a label loss L loss and a total loss L for the input image after processing the input image by the second group of layers, and updating the student network of the vision transformer based on the label loss L loss and the total loss L for the input image. 9. The vision transformer of claim 8 , wherein pruning tokens of the input image comprises pruning tokens of the input image having a token importance score that is less than a predetermined threshold value. 10. The vision transformer of claim 8 , wherein pruning tokens of the input image comprises adaptively pruning tokens of the input image having a token importance score that is less than a predetermined threshold value. 11. The vision transformer of claim 10 , wherein pruning tokens of the input image comprises pruning tokens that are not in a group of a minimum number of highest-weighted tokens having token importance scores that sum to be equal to greater than the predetermined threshold value. 12. The vision transformer of claim 8 , wherein pruning tokens of the input image comprises pruning tokens of the input image using a token mask M. 13. The vision transformer of claim 8 , wherein the token distillation loss L distill of the input image is further based on Kullback-Leiber divergence of the teacher network classification token CLS and the token importance score TIS P of the student network. 14. The vision transformer of claim 8 , wherein the first group of layers comprises a first three layers of the vision transformer. 15. A method to train a vision transformer, the method comprising: determining, at an output of a first group of layers of the vision transformer, a token distillation loss L distill of an input image based on a teacher network classification token CLS and a token importance score TIS P of a student network, the input image being part of an image database used to train the vision transformer for a predetermined number of epochs, and the student network comprising the vision transformer during training; processing the input image by a second group of layers of the vision transformer that are subsequent to the first group of layers by skipping sparsification of tokens of the input image within the first group of layers based on a current epoch being an odd number; processing the input image by the second group of layers by pruning tokens of the input image within the first group of layers using a token mask M based on the current epoch being an even number; determining a label loss L loss and a total loss L for the input image after processing the input image through the second group of layers; and updating the student network of the vision transformer based on the label loss L loss and a total loss L for the input image. 16. The method of claim 15 , wherein pruning tokens of the input image comprises pruning tokens of the input image having a token importance score that is less than a predetermined threshold value. 17. The method of claim 15 , wherein pruning tokens of the input image comprises adaptively pruning tokens of the input image having a token importance score that is less than a predetermined threshold value. 18. The method of claim 17 , wherein pruning tokens of the input image comprises pruning tokens that are not in a group of a minimum number of highest-weighted tokens having token importance scores that sum to be equal to greater than the predetermined threshold value. 19. The method of claim 15 , wherein the token distillation loss L distill of the input image is further based on Kullback-Leiber divergence of the teacher network classification token CLS and the token importance score TIS P of the student network. 20. The method of claim 15 , wherein the first group of layers comprises a first three layers of the vision transformer.

Assignees

Samsung Electronics Co Ltd

Inventors

Classifications

G06V10/764
using classification, e.g. of video objects · CPC title
G06V10/776
Validation; Performance evaluation · CPC title
G06V10/273
removing elements interfering with the pattern to be recognised · CPC title
G06V10/82
using neural networks · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title

Patent family

Related publications grouped by family.

View patent family 86328945

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12299960B2 cover?: A system and a method are disclosed for training a vision transformer. A token distillation loss of an input image based on a teacher network classification token and a token importance score of a student network (the vision transformer during training) are determined at a pruning layer of the vision transformer. When a current epoch number is odd, sparsification of tokens of the input image is…
Who is the assignee on this patent?: Samsung Electronics Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06V10/771. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 13 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Trust-region aware neural network architecture search for knowledge distillation

Self-supervised representation learning paradigm for medical images

Medical visual question answering

Contrastive Pre-Training for Language Tasks

Frequently asked questions