Parallel processing of a Softmax operation by dividing an input vector into a plurality of fragments

US12561114B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12561114-B2
Application numberUS-202117335858-A
CountryUS
Kind codeB2
Filing dateJun 1, 2021
Priority dateJul 10, 2020
Publication dateFeb 24, 2026
Grant dateFeb 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer comprising a plurality of processing units, each processing unit having an execution unit and access to computer memory which stores code executable by the execution unit and input values of an input vector to be processed by the code, the code, when executed, configured to access the computer memory to obtain multiple pairs of input values of the input vector, determine a maximum or corrected maximum input value of each pair as a maximum result element, determine and store in a computer memory a maximum or corrected maximum result of each pair of maximum result elements as an approximation to the natural log of the sum of the exponents of the input values and access the computer memory to obtain each input value and apply it to the maximum or corrected maximum result to generate each output value of a Softmax output vector.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A computer-implemented method of executing a neural network by processing an input vector comprising a plurality of input values, the method implemented in a hardware processor comprising a plurality of processing units on a single chip, each processing unit comprising an execution unit and having access to computer memory, wherein each processing unit is associated with and has access to its own computer memory which is not shared by others of the processing units, wherein the computer memory of each processing unit stores code executable by the execution unit of that processing unit, wherein the input vector is divided into a plurality of fragments, wherein there are no dependencies among any of the plurality of fragments, and wherein each processing unit further comprises a respective plurality of registers, the method comprising: storing in each computer memory a respective fragment of the input vector; the execution units operating in parallel wherein each execution unit accesses its own computer memory in a selecting step to obtain input values from its fragment of the input vector; and computing, using the hardware processor, an activation function of the input vector to generate an output vector, the output vector comprising a plurality of output values, wherein the computing the activation function of the input vector to generate the output vector comprises: each execution unit selecting, in the selecting step, from the input values of its fragment a largest value, wherein the largest value is determined by reading successive pairs of input values of its fragment into the respective plurality of registers and recursively determining a maximum value of pairs of values stored in the registers, wherein the largest value is determined to be an approximation of a natural logarithm of a sum of an exponential of each of the input values; sharing the largest value from the input values of each fragment with the other processing units and obtaining a maximum result value based on the largest value of all fragments and storing the maximum result value in the computer memory of each processing unit; the execution units operating in parallel wherein each execution unit accesses its computer memory in a computing step to obtain each input value of its fragment and the maximum result value; and computing each output value for the output vector, wherein computing each output value for the output vector comprises subtracting the maximum result value from each respective input value of the respective fragment to generate a respective output value of the respective fragment based on a natural logarithm of the activation function and combining the respective output values of all the fragments to generate the output vector, wherein the neural network comprises a layer, and wherein the input vector is an input to the layer and the output vector is an output of the layer. 2 . The computer-implemented method as claimed in claim 1 , wherein computing each output value comprises: exponentiating the each respective input value. 3 . The computer-implemented method as claimed in claim 1 , which comprises a compute phase and an exchange phase wherein the execution units operate in parallel in the compute phase and wherein the step of sharing is performed in the exchange phase. 4 . A computer program embodied on non-transitory computer-readable storage, the program comprising code configured so as when run on a hardware processor comprising a plurality of processing units on a single chip, each processing unit comprising an execution unit and having access to computer memory, wherein each processing unit is associated with and has access to its own computer memory which is not shared by others of the processing units, wherein the computer memory of each processing unit stores the code executable by the execution unit of that processing unit, wherein an input vector is divided into a plurality of fragments, wherein there are no dependencies among any of the plurality of fragments, and wherein each processing unit further comprises a respective plurality of registers, the hardware processor performs a method of executing a neural network by processing the input vector comprising a plurality of input values by performing operations including: storing in each computer memory a respective fragment of the input vector; the execution units operating in parallel wherein each execution unit accesses its own computer memory in a selecting step to obtain input values from its fragment of the input vector; computing an activation function of the input vector to generate an output vector, the output vector comprising one or more output values, wherein the computing the activation function of the input vector to generate the output vector comprises: each execution unit selecting, in the selecting step, from the input values of its fragment a largest value, wherein the largest value is determined by reading successive pairs of input values of its fragment into the respective plurality of registers and recursively determining a maximum value of pairs of values stored in the registers, wherein the largest value is determined to be an approximation of a natural logarithm of a sum of an exponential of each of the input values; sharing the largest value from the input values of each fragment with the other processing units and obtaining a maximum result value based on the largest value of all fragments and storing the maximum result value in the computer memory of each processing unit; the execution units operating in parallel wherein each execution unit accesses its computer memory in a computing step to obtain each input value of its fragment and the maximum result value; and computing each output value for the output vector, the computing comprising subtracting the maximum result value from each respective input value of the respective fragment to generate a respective output value of the respective fragment based on a natural logarithm of the activation function and combining the respective output values of all the fragments to generate the output vector, wherein the neural network comprises a layer, and wherein the input vector is an input to the layer and the output vector is an output of the layer. 5 . A computer configured to execute a neural network comprising a layer, the computer comprising: a plurality of processing units, each processing unit comprising an execution unit and having access to computer memory, wherein each processing unit is associated with and has access to its own computer memory which is not shared by others of the processing units, wherein the computer memory stores code executable by the execution unit and input values of an input vector to be processed by the code, wherein the input vector is divided into a plurality of fragments, wherein there are no dependencies among any of the plurality of fragments, and wherein each processing unit further comprises a respective plurality of registers, wherein the computer memory of each processing unit stores a set of input values of the input vector constituting a respective fragment of the plurality of fragments, and wherein the code, when executed, is configured to cause the processing units to operate in parallel wherein each processing unit is configured to: process its fragment by accessing its associated computer memory to obtain multiple pairs of the input values of its fragment of the input vector and reading the multiple pairs into the respective plurality of registers; recursively determine a maximum or a corrected maximum input value of each pair stored in the registers as a maximum result element of its fragment; store in the computer memory a maximum or a corrected maximum result of the maximum result element of the multip

Assignees

Inventors

Classifications

  • G06N3/048Primary

    Activation functions · CPC title

  • Arrangements for sorting or merging computer data on continuous record carriers, e.g. tape, drum, disc · CPC title

  • using electronic means · CPC title

  • G06F7/556Primary

    Logarithmic or exponential functions · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12561114B2 cover?
A computer comprising a plurality of processing units, each processing unit having an execution unit and access to computer memory which stores code executable by the execution unit and input values of an input vector to be processed by the code, the code, when executed, configured to access the computer memory to obtain multiple pairs of input values of the input vector, determine a maximum or…
Who is the assignee on this patent?
Graphcore Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/048. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).