System and method for coordinated link up handling following switch reset in a high performance computing network

US11262824B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11262824-B2
Application numberUS-202016735450-A
CountryUS
Kind codeB2
Filing dateJan 6, 2020
Priority dateDec 23, 2016
Publication dateMar 1, 2022
Grant dateMar 1, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for supporting coordinated link up handling following a switch reset in a high performance computing environment. Systems and methods can ensure that when a switch of a fabric is rebooted, HCA ports connected to that switch will be set in Active state at the same time even though link training times for different ports may vary with up to several seconds.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for supporting coordinated link up handling following a switch reset in a high performance computing environment, comprising: one or more microprocessors; a first subnet, the first subnet comprising a plurality of switches, the plurality of switches comprising at least a leaf switch, wherein each of the plurality of switches comprise a plurality of switch ports, a processor, and a memory, a plurality of end nodes, and a subnet manager, the subnet manager running on one of the plurality of switches; wherein a switch of the plurality of switches is reset; wherein, upon the reset of the switch, a subnet management agent of the switch associates the switch with a boot attribute, the boot attribute being accessible by the subnet manager; wherein the subnet manager queries the boot attribute, via a subnet management packet, to determine a status of each of the plurality of ports of the switch; wherein upon querying the boot attribute, the subnet manager determines a number of the plurality of switch ports of the switch that are initialized following the switch reset; and wherein the determined number of the plurality of switch ports of the switch that are initialized following the switch reset is less than the total number of switch ports of the switch. 2. The system of claim 1 , wherein the boot attribute comprises a Boolean value indicative of whether all of the plurality of switch ports on the switch are initialized following the switch reset. 3. The system of claim 1 , wherein upon querying the boot attribute, the subnet manager determines a number of the plurality of switch ports of the switch that have failed to initialize following the switch reset. 4. The system of claim 1 , wherein boot attribute is associated with a configurable timeout period, the timeout period being configurable by a system administrator. 5. The system of claim 4 , wherein upon querying the boot attribute, the subnet manager determines an amount of time that has passed since the switch underwent reset; wherein the subnet manager compares the amount of time that has passed since the switch underwent reset to the timeout period; and upon the amount of time that has passed since the switch underwent reset being longer than the timeout period, the subnet manger calls an error and clears the boot attribute. 6. A method for supporting coordinated link up handling following a switch reset in a high performance computing environment, comprising: providing, at one or more computers, including one or more microprocessors, a first subnet, the first subnet comprising: a plurality of switches, the plurality of switches comprising at least a leaf switch, wherein each of the plurality of switches comprise a plurality of switch ports, a processor, and a memory, a plurality of end nodes, and a subnet manager, the subnet manager running on one of the switches, resetting a switch of the plurality of switches; upon the reset of the switch, associating, by a subnet management agent of the switch, the switch with a boot attribute, the boot attribute being accessible by the subnet manager; and querying, by the subnet manager, the boot attribute, via a subnet management packet, to determine a status of each of the plurality of ports of the switch; wherein upon querying the boot attribute, the subnet manager determines a number of the plurality of switch ports of the switch that are initialized following the switch reset; and wherein the determined number of the plurality of switch ports of the switch that are initialized following the switch reset is less than the total number of switch ports of the switch. 7. The method of claim 6 , wherein the boot attribute comprises a Boolean value indicative of whether all of the plurality of switch ports on the switch are initialized following the switch reset. 8. The method of claim 6 , wherein upon querying the boot attribute, the subnet manager determines a number of the plurality of switch ports of the switch that have failed to initialize following the switch reset. 9. The method of claim 6 , wherein boot attribute is associated with a configurable timeout period, the timeout period being configurable by a system administrator. 10. The method of claim 9 , wherein upon querying the boot attribute, the subnet manager determines an amount of time that has passed since the switch underwent reset; wherein the subnet manager compares the amount of time that has passed since the switch underwent reset to the timeout period; and upon the amount of time that has passed since the switch underwent reset being longer than the timeout period, the subnet manger calls an error and clears the boot attribute. 11. A non-transitory computer readable storage medium having instructions thereon for supporting coordinated link up handling following a switch reset in a high performance computing environment, which when read and executed by one or more computers cause the one or more computers to perform steps comprising: providing, at one or more computers, including one or more microprocessors, a first subnet, the first subnet comprising: a plurality of switches, the plurality of switches comprising at least a leaf switch, wherein each of the plurality of switches comprise a plurality of switch ports, a processor, and a memory, a plurality of end nodes, and a subnet manager, the subnet manager running on one of the switches, resetting a switch of the plurality of switches; upon the reset of the switch, associating, by a subnet management agent of the switch, the switch with a boot attribute, the boot attribute being accessible by the subnet manager; and querying, by the subnet manager, the boot attribute, via a subnet management packet, to determine a status of each of the plurality of ports of the switch; wherein upon querying the boot attribute, the subnet manager determines a number of the plurality of switch ports of the switch that are initialized following the switch reset; and wherein the determined number of the plurality of switch ports of the switch that are initialized following the switch reset is less than the total number of switch ports of the switch. 12. The non-transitory computer readable storage medium of claim 11 , wherein the boot attribute comprises a Boolean value indicative of whether all of the plurality of switch ports on the switch are initialized following the switch reset. 13. The non-transitory computer readable storage medium of claim 11 , wherein upon querying the boot attribute, the subnet manager determines a number of the plurality of switch ports of the switch that have failed to initialize following the switch reset. 14. The non-transitory computer readable storage medium of claim 11 , wherein boot attribute is associated with a configurable timeout period, the timeout period being configurable by a system administrator; wherein upon querying the boot attribute, the subnet manager determines an amount of time that has passed since the switch underwent reset; wherein the subnet manager compares the amount of time that has passed since the switch underwent reset to the timeout period; and upon the amount of time that has passed since the switch underwent reset being longer than the timeout period, the subnet manger calls an error and clears the boot attribute.

Assignees

Inventors

Classifications

  • Hypervisor-specific management and integration aspects · CPC title

  • Distribution of virtual machine instances; Migration and load balancing · CPC title

  • Network integration; Enabling network access in virtual machine instances · CPC title

  • G06F1/24Primary

    Resetting means · CPC title

  • I/O management, e.g. providing access to device drivers or storage · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11262824B2 cover?
Systems and methods for supporting coordinated link up handling following a switch reset in a high performance computing environment. Systems and methods can ensure that when a switch of a fabric is rebooted, HCA ports connected to that switch will be set in Active state at the same time even though link training times for different ports may vary with up to several seconds.
Who is the assignee on this patent?
Oracle Int Corp
What technology area does this patent fall under?
Primary CPC classification G06F9/45558. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 01 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).