Combining compression, partitioning and quantization of DL models for fitment in hardware processors

US12430558B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12430558-B2
Application numberUS-202117447625-A
CountryUS
Kind codeB2
Filing dateSep 14, 2021
Priority dateJan 29, 2021
Publication dateSep 30, 2025
Grant dateSep 30, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Small and compact Deep Learning models are required for embedded AI in several domains. In many industrial use-cases, there are requirements to transform already trained models to ensemble embedded systems or re-train those for a given deployment scenario, with limited data for transfer learning. Moreover, the hardware platforms used in embedded application include FPGAs, AI hardware accelerators, System-on-Chips and on-premises computing elements (Fog/Network Edge). These are interconnected through heterogenous bus/network with different capacities. Method of the present disclosure finds how to automatically partition a given DNN into ensemble devices, considering the effect of accuracy—latency power—tradeoff, due to intermediate compression and effect of quantization due to conversion to AI accelerator SDKs. Method of the present disclosure is an iterative approach to obtain a set of partitions by repeatedly refining the partitions and generating a cascaded model for inference and training on ensemble hardware.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor implemented method, comprising: obtaining, via one or more hardware processors, a deep learning (DL) model and partitioning the DL model into a plurality of layers; partitioning, via the one or more hardware processors, the plurality of layers into k subsequences of layers based on k processing elements, wherein the k processing elements (PE) comprise a first processing element type; generating, for each layer in the plurality of layers of the DL model, via the one or more hardware processors, a transformed layer based on a selection of one or more compression configurations and one or more quantization configurations, wherein each of the one or more compression configurations and the one or more quantization configurations comprise a corresponding inference accuracy, a corresponding inference latency, and an inference power consumption, and wherein the corresponding inference accuracy, the corresponding inference latency, and the inference power consumption are obtained by executing a cascaded DL model inference program, wherein the one or more compression configurations and the one or more quantization configurations are either pre-defined or dynamically generated by respective compression and quantization units, wherein the one or more compression configurations and the one or more quantization configurations are generated based on a first scenario that includes a first processing element (PE) type and a second scenario that includes a second processing element (PE) type, wherein the first PE type and the second PE type are distinct from each other, wherein when (i) a p th processing element from k processing elements is the second PE type and (ii) the one or more compression configurations and the one or more quantization configurations are unavailable, the one or more compression configurations and the one or more quantization configurations are generated by: freezing and converting a p th subsequence of transformed layers from k subsequences of transformed layers for executing in a p th processing element serving as the second PE type; determining an inference accuracy, an inference latency, and an inference power consumption for the p th subsequence transformed layers on p th PE, at run-time based on (i) an intermediate output from a preceding subsequence of transformed layers and (ii) an intermediate output to a next subsequence of transformed layers; and generating the one or more quantization configurations and the one or more compression configurations based on the determined inference accuracy, the inference latency, and the inference power consumption for the p th subsequence of transformed layers; iteratively selecting, via the one or more hardware processors, at least one compression configuration and at least one quantization configuration for each of the transformed layers of the k subsequences of transformed layers, until each of the corresponding inference accuracy, the corresponding inference latency, and the inference power consumption reach a corresponding predefined threshold by: iteratively performing: assigning, via the one or more hardware processors, a first subsequence of transformed layers from the k subsequence of transformed layers to the first PE such that the first subsequences of transformed layers has a maximum inference accuracy, a minimum inference latency and a minimum power consumption on the first PE; partitioning, via the one or more hardware processors, a second subsequence of transformed layers from the k subsequences of transformed layers into a first set of transformed layers and a second set of transformed layers, and assigning the first set of transformed layers to the first PE and the second set of layers to the second PE, such that each of the first set of transformed layers and the second set of transformed layers has the maximum inference accuracy, the minimum inference latency and the minimum power consumption on the first PE and the second PE respectively, wherein the step of iteratively assigning and partitioning are performed to obtain a mapping of the first subsequence of transformed layers and the second subsequence of layers on the first PE and the second PE; partitioning, via the one or more hardware processors, the second subsequence of transformed layers of the DL model into two different partitions comprising of a set of transformed layers, and assigning the set of transformed layers from the two different partitions to the first PE and the second PE respectively, such that the second subsequences of layers has the maximum inference accuracy and the minimum inference latency, and the minimum power consumption on the first PE and the second PE; and continually partitioning, via the one or more hardware processors, subsequent subsequence of transformed layers into two partitions, each of the two partitions comprises a set of transformed layers, and assigning the set of transformed layers to an earlier processing element and a current processing element respectively, to obtain a mapping of k subsequences of transformed layers on the k processing elements, until the k subsequences of transformed layers have a maximum inference accuracy, a minimum inference latency and a minimum power consumption on the k processing elements (PE), wherein modification in index of the set of the transformed layers aids in obtaining the compression configurations and the quantization configurations; generating, via the one or more hardware processors, a DL model based on the mapped k subsequences of transformed layers on the k processing elements; executing, via the cascaded DL model inference program executed by the one or more hardware processors, the DL model and determining an overall inference accuracy and an overall inference latency, and an overall inference power consumption; rewarding or penalizing, via the one or more hardware processors, by implementing a re-inforcement learning (RL) technique, a sequence learning network based on a comparison of the overall inference accuracy and the overall inference latency, and the overall inference power consumption of (i) a current iteration of the cascaded DL model inference program generated by a selection of sequence of transformed layers and (ii) a previous iteration of the cascaded DL model inference program generated by a selection of sequence of transformed layers, wherein a reward is provided to the sequence learning network if there is an optimal assignment of k subsequences of the set of transformed layers on the k processing elements and wherein a penalty is assigned to the sequence learning network if there is a lack in the optimal assignment of k subsequences of the set of transformed layers on the k processing elements; and identifying, the generated DL model as a final DL model based on the mapped k subsequences of transformed layers on the k processing elements for the scenario. 2. The processor implemented method of claim 1 , wherein when (i) a p th processing element from the k processing elements is a second processing element type, (ii) the one or more compression configurations and the one or more quantization configurations are unavailable and (iii) training data is available for re-training of the obtained DL model, the one or more compression configurations and the one or more quantization configurations are generated by: freezing and converting a p th subsequence of transformed layers from k subsequences of transformed layers for executing in a p th processing element serving as the second PE type; re-training remaining transformed layers of the DL model without training the p th subsequence of transformed layers deployed on the p th processing element serving as the second PE type, wherein the remaining transformed layers of the DL model are retrained using an intermediate output of (i) a preceding subsequence

Assignees

Inventors

Classifications

  • characterised by the process organisation or structure, e.g. boosting cascade · CPC title

  • G06N3/063Primary

    using electronic means · CPC title

  • Monitoring of events, devices or parameters that trigger a change in power modality · CPC title

  • Non-supervised learning, e.g. competitive learning · CPC title

  • G06N3/082Primary

    modifying the architecture, e.g. adding, deleting or silencing nodes or connections · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12430558B2 cover?
Small and compact Deep Learning models are required for embedded AI in several domains. In many industrial use-cases, there are requirements to transform already trained models to ensemble embedded systems or re-train those for a given deployment scenario, with limited data for transfer learning. Moreover, the hardware platforms used in embedded application include FPGAs, AI hardware accelerato…
Who is the assignee on this patent?
Tata Consultancy Services Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 30 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).