Apparatus and method for multiply, add/subtract, and accumulate of packed data elements

US11074073B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11074073-B2
Application numberUS-201715721225-A
CountryUS
Kind codeB2
Filing dateSep 29, 2017
Priority dateSep 29, 2017
Publication dateJul 27, 2021
Grant dateJul 27, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus and method for performing dual concurrent multiplications, subtraction/addition, and accumulation of packed data elements. For example one embodiment of a processor comprises: a decoder to decode an instruction to generate a decoded instruction; a first source register to store first and second packed data elements; a second source register to store third and fourth packed data elements; execution circuitry to execute the decoded instruction, the execution circuitry comprising: multiplier circuitry to multiply the first and third packed data elements to generate a first temporary product and to concurrently multiply the second and fourth packed data elements to generate a second temporary product, the first through fourth packed data elements all being a first width; circuitry to negate the first temporary product to generate a negated first product; adder circuitry to add the first negated product to a first accumulated packed data element from a third source register to generate a first result, the first result being a second width which is at least twice as large as the first width; the adder circuitry to concurrently add the second temporary product to a second accumulated packed data element to generate a second result of the second width; the first and second results to be stored in specified first and second data element positions within a destination register.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor comprising: a decoder to decode an instruction that specifies a first register, a second register, and a third register to generate a decoded instruction; the first register to store first and second packed data elements, both of which being signed; the second register to store third and fourth packed data elements, both of which being signed; and execution circuitry to execute the decoded instruction, the execution circuitry comprising: multiplying the first and third packed data elements to generate a first temporary product that is signed; concurrently multiplying the second and fourth packed data elements to generate a second temporary product that is signed, the first through fourth packed data elements all being a first width; negating the first temporary product to generate a negated first product based on bit positions of the first and third packed data elements in the first and second registers respectively; adding the first negated product to a first accumulated packed data element from the third register to generate a first result, the first result being a second width which is at least twice as large as the first width; concurrently adding the second temporary product to a second accumulated packed data element to generate a second result of the second width; and storing the first and second results in specified first and second data element positions within the third register. 2. The processor of claim 1 wherein negating the first temporary product comprises inverting all bits of the first temporary product to generate an inverted temporary result and adding a bit to the inverted temporary result to generate the negated first product. 3. A method comprising: decoding an instruction that specifies a first register, a second register, and a third register to generate a decoded instruction; and executing the decoded instruction, wherein the execution comprises: storing first and second packed data elements that are signed in the first register; storing third and fourth packed data elements that are signed in the second register; multiplying the first and third packed data elements to generate a first temporary product that is signed; concurrently multiplying the second and fourth packed data elements to generate a second temporary product that is signed, the first through fourth packed data elements all being a first width; negating the first temporary product to generate a negated first product based on bit positions of the first and third packed data elements in the first and second registers respectively; adding the first negated product to a first accumulated packed data element from the third register to generate a first result, the first result being a second width which is at least twice as large as the first width; concurrently adding the second temporary product to a second accumulated packed data element to generate a second result of the second width; and storing the first and second results in specified first and second data element positions within the third register. 4. The method of claim 3 wherein negating the first temporary product comprises inverting all bits of the first temporary product to generate an inverted temporary result and adding a bit to the inverted temporary result to generate the first negated first product. 5. A non-transitory machine-readable medium having program code stored thereon which, when executed by a machine, causes the machine to perform the operations of: decoding an instruction that specifies a first register, a second register, and a third register to generate a decoded instruction; and executing the decoded instruction, wherein the execution comprises: storing first and second packed data elements that are signed in the first register; storing third and fourth packed data elements that are signed in the second register; multiplying the first and third packed data elements to generate a first temporary product that is signed; concurrently multiplying the second and fourth packed data elements to generate a second temporary product that is signed, the first through fourth packed data elements all being a first width; negating the first temporary product to generate a negated first product based on bit positions of the first and third packed data elements in the first and second registers respectively; adding the first negated product to a first accumulated packed data element from the third register to generate a first result, the first result being a second width which is at least twice as large as the first width; concurrently adding the second temporary product to a second accumulated packed data element to generate a second result of the second width; and storing the first and second results in specified first and second data element positions within the third register. 6. The non-transitory machine-readable medium of claim 5 wherein negating the first temporary product comprises inverting all bits of the first temporary product to generate an inverted temporary result and adding a bit to the inverted temporary result to generate the negated first product. 7. The processor of claim 1 , wherein the first width is 32-bit long and the second width is 64-bit long. 8. The processor of claim 1 , wherein the first packed data element is stored in a lower bit position than the second packed data element in the first register, the third packed data element is stored in a lower bit position than the fourth packed data element in the second register, and the first result is stored in a lower bit position than the second result in the third register. 9. The processor of claim 1 , wherein storing the first and second results causes a saturation flag to be set. 10. The non-transitory machine-readable medium of claim 5 , wherein the first width is 32-bit long and the second width is 64-bit long. 11. The non-transitory machine-readable medium of claim 5 , wherein the first packed data element is stored in a lower bit position than the second packed data element in the first register, the third packed data element is stored in a lower bit position than the fourth packed data element in the second register, and the first result is stored in a lower bit position than the second result in the third register. 12. The non-transitory machine-readable medium of claim 5 , wherein storing the first and second results causes a saturation flag to be set.

Assignees

Inventors

Classifications

  • Bit or string instructions · CPC title

  • Instructions to perform operations on packed data, e.g. vector, tile or matrix operations · CPC title

  • using a mask · CPC title

  • to perform operations on data operands · CPC title

  • using a secondary processor, e.g. coprocessor (peripheral processor G06F13/12) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11074073B2 cover?
An apparatus and method for performing dual concurrent multiplications, subtraction/addition, and accumulation of packed data elements. For example one embodiment of a processor comprises: a decoder to decode an instruction to generate a decoded instruction; a first source register to store first and second packed data elements; a second source register to store third and fourth packed data ele…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/30036. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 27 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).