Efficiency of vision transformers with adaptive token pruning

US12299960B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12299960-B2
Application numberUS-202217978959-A
CountryUS
Kind codeB2
Filing dateNov 1, 2022
Priority dateMay 10, 2022
Publication dateMay 13, 2025
Grant dateMay 13, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and a method are disclosed for training a vision transformer. A token distillation loss of an input image based on a teacher network classification token and a token importance score of a student network (the vision transformer during training) are determined at a pruning layer of the vision transformer. When a current epoch number is odd, sparsification of tokens of the input image is skipped and the dense input image is processed by layers that are subsequent to the pruning layer. When the current epoch number is even, tokens of the input image are pruned at the pruning layer and processed by layers that are subsequent to the pruning layer. A label loss and a total loss for the input image are determined by the subsequent layers and the student network is updated.

First claim

Opening claim text (preview).

What is claimed is: 1. A method to train a vision transformer, the method comprising: determining, at a pruning layer P of the vision transformer, a token distillation loss L distill of an input image based on a teacher network classification token CLS and a token importance score TIS P of a student network at the pruning layer P, the input image being part of an image database used to train the vision transformer for a predetermined number of epochs, and the student network comprising the vision transformer during training; processing the input image by layers of the vision transformer that are subsequent to the pruning layer P by skipping sparsification of tokens of the input image at the pruning layer P based on a current epoch being an odd number; processing the input image by layers of the vision transformer that are subsequent to the pruning layer P by pruning tokens of the input image at the pruning layer P based on the current epoch being an even number; determining a label loss L loss and a total loss L for the input image after processing the input image by layers of the vision transformer that are subsequent to the pruning layer P; and updating the student network of the vision transformer based on the label loss L loss and the total loss L for the input image. 2. The method of claim 1 , wherein pruning tokens of the input image comprises pruning tokens of the input image having a token importance score that is less than a predetermined threshold value. 3. The method of claim 1 , wherein pruning tokens of the input image comprises adaptively pruning tokens of the input image having a token importance score that is less than a predetermined threshold value. 4. The method of claim 3 , wherein pruning tokens of the input image comprises pruning tokens that are not in a group of a minimum number of highest-weighted tokens having token importance scores that sum to be equal to greater than the predetermined threshold value. 5. The method of claim 1 , wherein pruning tokens of the input image at the pruning layer P prunes tokens of the input image using a token mask M. 6. The method of claim 1 , wherein the token distillation loss L distill of the input image is further based on Kullback-Leiber divergence of the teacher network classification token CLS and the token importance score TIS P of the student network. 7. The method of claim 1 , wherein the pruning layer P comprises a third layer of the vision transformer. 8. A vision transformer, comprising a first group of layers outputs a token distillation loss L distill of an input image based on a teacher network classification token CLS and a token importance score TIS P of a student network, the input image being part of an image database used to train the vision transformer for a first predetermined number of epochs, and the student network comprising the vision transformer during training; and a second group of layers that are subsequent to the first group of layers that are trained by: processing the input image by the second group of layers by skipping sparsification of tokens of the input image within the first group of layers based on a current epoch being an odd number, processing the input image by the second group of layers by pruning tokens of the input image within the first group of layers based on the current epoch being an even number, determining a label loss L loss and a total loss L for the input image after processing the input image by the second group of layers, and updating the student network of the vision transformer based on the label loss L loss and the total loss L for the input image. 9. The vision transformer of claim 8 , wherein pruning tokens of the input image comprises pruning tokens of the input image having a token importance score that is less than a predetermined threshold value. 10. The vision transformer of claim 8 , wherein pruning tokens of the input image comprises adaptively pruning tokens of the input image having a token importance score that is less than a predetermined threshold value. 11. The vision transformer of claim 10 , wherein pruning tokens of the input image comprises pruning tokens that are not in a group of a minimum number of highest-weighted tokens having token importance scores that sum to be equal to greater than the predetermined threshold value. 12. The vision transformer of claim 8 , wherein pruning tokens of the input image comprises pruning tokens of the input image using a token mask M. 13. The vision transformer of claim 8 , wherein the token distillation loss L distill of the input image is further based on Kullback-Leiber divergence of the teacher network classification token CLS and the token importance score TIS P of the student network. 14. The vision transformer of claim 8 , wherein the first group of layers comprises a first three layers of the vision transformer. 15. A method to train a vision transformer, the method comprising: determining, at an output of a first group of layers of the vision transformer, a token distillation loss L distill of an input image based on a teacher network classification token CLS and a token importance score TIS P of a student network, the input image being part of an image database used to train the vision transformer for a predetermined number of epochs, and the student network comprising the vision transformer during training; processing the input image by a second group of layers of the vision transformer that are subsequent to the first group of layers by skipping sparsification of tokens of the input image within the first group of layers based on a current epoch being an odd number; processing the input image by the second group of layers by pruning tokens of the input image within the first group of layers using a token mask M based on the current epoch being an even number; determining a label loss L loss and a total loss L for the input image after processing the input image through the second group of layers; and updating the student network of the vision transformer based on the label loss L loss and a total loss L for the input image. 16. The method of claim 15 , wherein pruning tokens of the input image comprises pruning tokens of the input image having a token importance score that is less than a predetermined threshold value. 17. The method of claim 15 , wherein pruning tokens of the input image comprises adaptively pruning tokens of the input image having a token importance score that is less than a predetermined threshold value. 18. The method of claim 17 , wherein pruning tokens of the input image comprises pruning tokens that are not in a group of a minimum number of highest-weighted tokens having token importance scores that sum to be equal to greater than the predetermined threshold value. 19. The method of claim 15 , wherein the token distillation loss L distill of the input image is further based on Kullback-Leiber divergence of the teacher network classification token CLS and the token importance score TIS P of the student network. 20. The method of claim 15 , wherein the first group of layers comprises a first three layers of the vision transformer.

Assignees

Inventors

Classifications

  • using classification, e.g. of video objects · CPC title

  • Validation; Performance evaluation · CPC title

  • removing elements interfering with the pattern to be recognised · CPC title

  • using neural networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12299960B2 cover?
A system and a method are disclosed for training a vision transformer. A token distillation loss of an input image based on a teacher network classification token and a token importance score of a student network (the vision transformer during training) are determined at a pruning layer of the vision transformer. When a current epoch number is odd, sparsification of tokens of the input image is…
Who is the assignee on this patent?
Samsung Electronics Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06V10/771. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 13 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).