What technology area does this patent fall under?

Primary CPC classification G06N3/0454. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 01 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method for distributed type training adaptation and apparatus in deep learning framework and AI accelerator card

US11714995B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11714995-B2
Application number	US-202217739205-A
Country	US
Kind code	B2
Filing date	May 9, 2022
Priority date	Dec 8, 2021
Publication date	Aug 1, 2023
Grant date	Aug 1, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed is a method for distributed type training adaptation and apparatus in a deep learning framework and an AI accelerator card. The method includes the following steps: S1: the deep learning framework supports single-card configuration in a newly added AI accelerator card, and sub-steps thereof are as follows: S11: the deep learning framework supports new hardware; S12: the deep learning framework supports a device thread of the new hardware; S13: the deep learning framework supports a memory operation of the new hardware; and S14: the deep learning framework supports an operator kernel function of the new hardware; S2: the deep learning framework supports multi-card configuration in the newly added AI accelerator card; S3: the deep learning framework supports tensor segmentation and multi-card distribution; and S4: the deep learning framework supports multi-card collective communication in the newly added AI accelerator card.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for distributed type training adaptation in a deep learning framework and an Artificial Intelligence (AI) accelerator card, comprising the following steps: S/1: supporting single-card configuration in a newly added Al accelerator card by the deep learning framework, wherein sub-steps thereof are as follows: S11:supporting new hardware by the deep learning framework, wherein sub-steps are as follows: S111: adding a new hardware device field by the deep learning framework: adding a new device type to the deep learning framework, creating an enumeration class of the device type, and adding a device field corresponding to the new hardware to the device type; S112: registering the new hardware into the deep learning framework: registering the new hardware device field added in step S111 into the framework; S113: adding context information of a new hardware basic software library to the deep learning framework: newly adding a class of the context information of the new hardware basic software library, adding a handle member variable configured to store a context information structure of the new hardware basic software library and a member function for obtaining a handle of the context information of the new hardware basic software library, but not specifying a device ID of a specific accelerator card managed by the context information; and S114: adding a program flow of the new hardware basic software library to the deep learning framework, comprising newly adding a handle for obtaining and executing a new hardware program flow and a new hardware program flow index generator; S12: supporting a device thread of the new hardware by the deep learning framework, wherein sub-steps are as follows: S121: adding a thread structure of a new hardware device: creating a hardware device type thread structure, comprising following member variables: a task executor polling thread, a current task queue, a recall event, an event recall queue, and an event recall queue polling thread; and S122: registering the thread structure of the new hardware device: registering the hardware device thread structure added in step S121 into the deep learning framework; S13: supporting a memory operation of the new hardware by the deep learning framework, wherein sub-steps are as follows: S131: adding a memory type field of the new hardware device; S132: applying for a memory of a new hardware device type; S133: adding a memory copy interface of the new hardware device type; and S134: adding a memory allocation interface of the new hardware device; and S14: supporting an operator kernel function of the new hardware by the deep learning framework, wherein sub-steps are as follows: S141: adding a constructor of the operator kernel function that supports the new hardware device type: newly adding the constructor of the operator kernel function that supports the new hardware device type, the above constructor being configured to create the function of different operators that support the new hardware device type; S142: registering the operator kernel function that supports the new hardware device type; S143: registering kernel functions of system-level operators: using the constructor of the operator kernel function that supports the new hardware device type added in steps S141 and S142 and registering the operator kernel function that supports the new hardware device type to sequentially create and register following system-level operators: an input/output operator, a kernel function loading operator, and a weight/bias variable operator; and S144: newly adding kernel functions of user-level operators: adding forward/reverse kernel functions of different operators supporting the new hardware device type required for building a deep neural network model and registering into the corresponding operators; S2: supporting multi-card configuration in the newly added AI accelerator card by the deep learning framework, and requiring the deep learning framework to support context information of new hardware multi-card management, wherein specific sub-steps are as follows: S21: newly adding the class of the context information of the new hardware basic software library; S22: adding a member variable that stores and manages a container type of a plurality of context information structure handles of a plurality of accelerator cards; and S23: adding a member function for obtaining the above context information handles, wherein a function of the function is to initialize the new hardware basic software library, and the function obtains the corresponding container member of the above context information according to a device ID of a specified on-chip accelerator card; S3: supporting tensor segmentation and multi-card distribution by the deep learning framework, wherein specific sub-steps are as follows: S31: supporting tensor segmentation by the deep learning framework: in a distributed type training process of the deep learning framework, deriving to generate a physical calculation graph from a logical calculation graph in a compilation process of a deep learning compiler, and allocating a tensor between upstream and downstream nodes by using a tensor segmentation and broadcasting mode; S32: creating and registering an asynchronous memory copier of the new hardware device; and S33: supporting tensor multi-card distribution by the deep learning framework: using the asynchronous memory copier of the new hardware device registered in step S32 to distribute tensor components obtained by segmentation in step S31 to a plurality of cards, specific steps are: first, determining a memory copy situation according to a source memory type and a target memory type; and secondly, using the asynchronous memory copier of the new hardware device in step S32 to distribute the tensor components segmented in step S31 to the plurality of cards for the determined memory copy situation; and S4: supporting multi-card collective communication in the newly added AI accelerator card by the deep learning framework, wherein a goal is to aggregate forward calculation results of all the cards for each card, according to different aggregation modes, sub-steps of supporting multi-card collective communication in the newly added Al accelerator card by the deep learning framework comprise two solutions of collective communication based on Ring AllReduce operation and collective communication based on AllReduce operation: collective communication based on Ring AllReduce operation means that each card aggregates the forward calculation results of all the cards in a mode of tensor addition; and the collective communication mode based on AllReduce operation is to use a host as a central node, first globally reduce and receive data of all other nodes, and then broadcast back to all other nodes after local calculation. 2. The method for distributed type training adaptation in the deep learning framework and the AI accelerator card according to claim 1 , wherein in step S113, the member function for obtaining the handle of the context information of the new hardware basic software library is added, a function of the function is to initialize the new hardware basic software library, hardware resources are allocated on the host, and the function must be called firstly before any other new hardware basic software library function is called. 3. The method for distributed type training adaptation in the deep learning framework and the AI accelerator card according to claim 1 , wherein in step S114, a specific method for newly adding a handle for obtaining and executing a new hardware program flow is: according to a handle of the context information structure of the new hardware basic software library obtained in step S113, creating the handle for executing the program flow; and a function of the newly ad

Assignees

Zhejiang Lab

Inventors

Classifications

G06N3/0454Primary
Physics · mapped topic
G06F8/36
Software reuse · CPC title
G06F9/4881
Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title
G06F9/545
where tasks reside in different layers, e.g. user- and kernel-space · CPC title
G06F9/5016Primary
the resource being the memory · CPC title

Patent family

Related publications grouped by family.

View patent family 79248767

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11714995B2 cover?: Disclosed is a method for distributed type training adaptation and apparatus in a deep learning framework and an AI accelerator card. The method includes the following steps: S1: the deep learning framework supports single-card configuration in a newly added AI accelerator card, and sub-steps thereof are as follows: S11: the deep learning framework supports new hardware; S12: the deep learning …
Who is the assignee on this patent?: Zhejiang Lab
What technology area does this patent fall under?: Primary CPC classification G06N3/0454. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 01 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Dataflow all-reduce for reconfigurable processor systems

Distributed weight update for backpropagation of a neural network

Accelerating multi-node performance of machine learning workloads

Distributed computing system, and data transmission method and apparatus in distributed computing system

Communication optimizations for distributed machine learning

Multi-gpu deep learning using cpus

Frequently asked questions