Minimizing impact of first failure data capture on computing system using recovery process boost
US-2022405163-A1 · Dec 22, 2022 · US
US12572405B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12572405-B2 |
| Application number | US-202418645383-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 25, 2024 |
| Priority date | Apr 25, 2024 |
| Publication date | Mar 10, 2026 |
| Grant date | Mar 10, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Providing issue resolution in a cluster system by monitoring system operation to detect occurrence of a system error, and automatically generating, upon detection of the error condition, a core file for a user node. The core file captures a current memory state of a respective node, where the current memory state comprises system statistics, system information, and logs. An intelligent core debugger extracts information from a core file to generate a core file report that is sent to a vendor for a quick determination of whether sufficient information is in the report to allow the vendor to recommend a fix, or whether further information from is required, including the core file itself, if necessary. This prevents the need to send an entire core file to a vendor in every instance of a system fault.
Opening claim text (preview).
What is claimed is: 1 . A method of facilitating efficient debugging of a fault in a node operated by a user and executing containerized applications in a cluster system provided by a vendor, comprising: detecting generation of a core file upon occurrence of the fault; first verifying that the core file is analyzed based on relevancy of the core file to the fault and a set of defined rules; second verifying that the node is a suitable debugging environment for analyzing the core file by determining if the node has sufficient processor bandwidth to process the core file, and has sufficient memory to decompress the core file without preventing another core file from being saved during the analyzing step; extracting relevant information from the core file to form a core file report; transmitting the core file report to the vendor, the core file report facilitating analysis by the vendor to recommend to the user a solution to the fault, or request from the user further information or the core file itself; compressing the core file after generation; decompressing the core file at the node; running one or more debugging tools on the decompressed core file; and recompressing the decompressed core file. 2 . The method of claim 1 wherein the suitable debugging environment comprises an original binary program that generated the core file, and any relevant libraries used by the program. 3 . The method of claim 1 wherein the step of extracting further comprises: monitoring activity in a folder holding the core file; and processing open streams of the monitored activity to identify one or more processing threads causing the fault. 4 . The method of claim 3 wherein the activity monitoring step is performed by a cron job executed on a periodic basis, and wherein the cron job wakes up periodically to scan a directory of the folder for uncompressed core files to identify the selected core file for compression prior to analysis at the node. 5 . The method of claim 4 wherein the cron job is invoked by an autotriage process executed at the node, the method further comprising: listing by the autotriage process, all thread back traces in the core file, wherein a back trace consists of a function call history showing a chain of function calls that existed at the time the core file was generated; and transmitting the core file report from the node to the vendor as telemetry data. 6 . The method of claim 1 wherein the defined rules comprise: analyzing the core file if a most recently generated core file report is older than a defined minimum age; analyzing only a most recent core file of a supported core file type; analyzing the core file only if the core file does not have an existing core file report; and analyzing the core file only if the core file is newer than the most recently generated core file report. 7 . The method of claim 1 wherein the cluster system comprises a Santorini filesystem network processing containerized data utilizing a Kubernetes-based framework, and further comprises part of a deduplication backup system performing backup and restore operations for the plurality of nodes, and further wherein each node of the cluster executes the applications from respective pods in a corresponding cluster, and further wherein the containerized applications comprise at least one of a Data Domain container running deduplication and compression processes, a cloud-native data protection manager, and a scalable object storage manager. 8 . The method of claim 7 wherein the core file is automatically generated upon a crash of an application or node in the cluster system, and each core file comprises a current memory state of a respective node, the current memory state comprising system statistics, system information, and logs for each node, and further wherein: the system statistics comprise performance and activity data for applications executed by the nodes, including read/write latencies, read/write throughputs, replication throughput, and garbage collection performance; the system information comprises total storage capacity, currently utilized storage capacity, and remaining storage capacity; and the logs comprise information related to at least one of: component availability state changes, component failures and errors, configuration changes, changes to source code in production, or configuration changes in a production system. 9 . A computer program product, comprising a non-transitory computer-readable medium having a computer-readable program code embodied therein, the computer-readable program code, when executed by one or more processors, performs a method of facilitating efficient debugging of a fault in a node operated by a user and executing containerized applications in a cluster system provided by a vendor, comprising: detecting generation of a core file upon occurrence of the fault; first verifying that the core file is analyzed based on relevancy of the core file to the fault and a set of defined rules; second verifying that the node is a suitable debugging environment for analyzing the core file by determining if the node has sufficient processor bandwidth to process the core file, and has sufficient memory to decompress the core file without preventing another core file from being saved during the analyzing step; extracting relevant information from the core file to form a core file report; transmitting the core file report to the vendor, the core file report facilitating analysis by the vendor to recommend to the user a solution to the fault, or request from the user further information or the core file itself; compressing the core file after generation; decompressing the core file at the node; running one or more debugging tools on the decompressed core file; and recompressing the decompressed core file. 10 . The computer program product of claim 9 wherein the cluster network comprises a Santorini filesystem network processing containerized data utilizing a Kubernetes-based framework, and further comprises part of a deduplication backup system performing backup and restore operations for the plurality of nodes, and further wherein each node of the cluster executes the applications from respective pods in a corresponding cluster, and further wherein the containerized applications comprise at least one of a Data Domain container running deduplication and compression processes, a cloud-native data protection manager, and a scalable object storage manager.
in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems · CPC title
Remedial or corrective actions (recovery from an exception in an instruction pipeline G06F9/3861; by retry G06F11/1402; for recovering from a failure of a protocol instance or entity H04L69/40) · CPC title
Error or fault detection not based on redundancy (power supply failures G06F1/30; network fault management H04L41/06) · CPC title
Root cause analysis, i.e. error or fault diagnosis (in a hardware test environment G06F11/22; in a software test environment G06F11/36) · CPC title
Dumping, i.e. gathering error/state information after a fault for later diagnosis · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.