Distributed pattern discovery
US-9830451-B2 · Nov 28, 2017 · US
US10733149B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10733149-B2 |
| Application number | US-201815979512-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 15, 2018 |
| Priority date | May 18, 2017 |
| Publication date | Aug 4, 2020 |
| Grant date | Aug 4, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods for data reduction including organizing data of an event stream into a file access table concurrently with receiving the event stream, the data including independent features and dependent features. A frequent pattern tree (FP-Tree) is built including nodes corresponding to the dependent features according to a frequency of occurrence of the dependent features relative to the independent features. Each single path in the FP-Tree is merged into a special node corresponding to segments of dependent features to produce a reduced FP-Tree. All path combinations in the reduced FP-Tree are identified. A compressible file access template (CFAT) is generated corresponding to each of the path combinations. The data of the event stream is compressed with the CFATs to reduce the dependent features to special events representing the dependent features.
Opening claim text (preview).
What is claimed is: 1. A method for reducing data for storage in a data storage system, the method comprising: organizing data of an event stream into a file access table concurrently with receiving the event stream, the data including independent features and dependent features; building a frequent pattern tree (FP-Tree) including nodes corresponding to the dependent features according to a frequency of occurrence of the dependent features relative to the independent features; merging each single path in the FP-Tree into a special node corresponding to segments of dependent features to produce a reduced FP-Tree, the merging comprising weakly dominated path (WDP) merging, wherein the merging is determined to be WDP merging by (1−σ)<(p1·counter)/(p2·counter)<(1+σ), where p1 represents a first node, p2 represents a second node, σ is a deviation, and counter represents a counter of a node; identifying all path combinations in the reduced FP-Tree; generating a compressible file access template (CFAT) corresponding to each of the path combinations; and compressing the data of the event stream with the CFATs to reduce the dependent features to special events representing the dependent features; and selecting segments of the dependent features according to a data reduction score of each of the segments based on a size of the segment and a frequency of occurrence of the segment, the data reduction score being determined by score=t·size×t·freq−t·size−t·freq, where score represents the data reduction score, t represents the segment, size represents the size of the segment, and freq represents the frequency of occurrence of the segment. 2. The method of claim 1 , wherein the single paths include sets of nodes of the FP-Tree that have only one parent node and only one child node. 3. The method of claim 1 , wherein the frequency of occurrence of the segment is determined according to a counter of a node corresponding to a least frequently occurring dependent feature of the corresponding single path. 4. The method of claim 1 , further including: comparing each segment to every other segment of a corresponding single path to determine an intersection with another segment; and removing an intersecting segment having a lower score than a segment with which the intersecting segment intersects. 5. The method of claim 1 , further including selecting path combinations of the reduced FP-Tree according to a data reduction score of each of the path combinations based on a size of the path combination and a frequency of occurrence of the path combination. 6. The method of claim 5 , wherein the frequency of occurrence of the path combination is determined according to a counter of a node or special node corresponding to a least frequently occurring node or special node in the path combination. 7. The method of claim 5 , further including: comparing each path combination to every other path combination to determine an overlap with another path combination; and removing an overlapping path combination having a lower score than a path combination with which the overlapping path combination overlaps. 8. The method of claim 1 , further including: generating a finite state automaton by ordering the dependent features in each CFAT; converting the ordered dependent features of each CFAT into a string; and using the finite state automaton strings to match CFATs to dependent feature sequences corresponding to an independent feature in the event stream. 9. The method of claim 1 , wherein the independent features include processes performed by computers in a network, and the dependent features include files accessed in initial stages of the processes. 10. A method for reducing data for storage in a data storage system, the method comprising: collecting data in an event stream from a network of computers, the data including each process run by each computer and files accessed by each process in initial stages of each of processes; organizing the data into a file access table concurrently with receiving the event stream; building a frequent pattern tree (FP-Tree) including nodes corresponding to the files according to a frequency of occurrence of the files relative to the processes; merging each single path in the FP-Tree into a special node corresponding to segments of files to produce a reduced FP-Tree, the merging comprising weakly dominated path (WDP) merging, wherein the merging is determined to be WDP merging by (1−σ)<(p1·counter)/(p2·counter)<(1+σ), where p1 represents a first node, p2 represents a second node, σ is a deviation, and counter represents a counter of a node; identifying all path combinations in the reduced FP-Tree; generating a compressible file access template (CFAT) corresponding to each of the path combinations; compressing the data of the event stream with the CFATs to reduce the files to special events representing the files; selecting segments of the dependent features according to a data reduction score of each of the segments based on a size of the segment and a frequency of occurrence of the segment, the data reduction score being determined by score=t·size×t·freq−t·size−t·freq, where score represents the data reduction score, t represents the segment, size represents the size of the segment, and freq represents the frequency of occurrence of the segment; and analyzing the compressed data with a pattern analysis system. 11. The method of claim 10 , wherein the single paths include sets of nodes of the FP-Tree that have only one parent node and only one child node. 12. The method of claim 10 , further including selecting segments of the files according to a data reduction score of each of the segments based on a size of the segment and a frequency of occurrence of the segment. 13. The method of claim 12 , wherein the frequency of occurrence of the segment is determined according to a counter of a node corresponding to a least frequently occurring file of the corresponding single path. 14. The method of claim 12 , further including: comparing each segment to every other segment of a corresponding single path to determine an intersection with another segment; and removing an intersecting segment having a lower score than a segment with which the intersecting segment intersects. 15. The method of claim 10 , further including selecting path combinations of the reduced FP-Tree according to a data reduction score of each of the path combinations based on a size of the path combination and a frequency of occurrence of the path combination. 16. The method of claim 15 , wherein the frequency of occurrence of the path combination is determined according to a counter of a node or special node corresponding to a least frequently occurring node or special node in the path combination. 17. The method of claim 15 , further including: comparing each path combination to every other path combination to determine an overlap with another path combination; and removing an overlapping path combination having a lower score than a path combination with which the overlapping path combination overlaps. 18. The method of claim 10 , further including: generating a finite state automaton by ordering the files in each CFAT; converting the ordered files of each CFAT into a string; and using the finite state automaton strings to match CFATs to file sequences correspond to a process in the event stream. 19. The method of claim 10 , wherein building the FP-Tree further includes pruning infrequently occurring paths of nodes from the FP-Tree to
to a system of files or objects, e.g. local or distributed file system or database · CPC title
using compression, e.g. sparse files · CPC title
Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram · CPC title
Trees · CPC title
Clearing memory, e.g. to prevent the data from being stolen · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.