Hierarchical index based compression

US9355111B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9355111-B2
Application numberUS-201414265806-A
CountryUS
Kind codeB2
Filing dateApr 30, 2014
Priority dateApr 30, 2014
Publication dateMay 31, 2016
Grant dateMay 31, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Computer-readable media, systems, and methods for hierarchical index based compression are described. In embodiments, a hierarchical data log or key-value pair based data log, such as a JSON log, is received and a tree-structured index (index tree) is recursively constructed. In one embodiment, the log comprises search-engine user interaction information. Structural information of the log is preserved by the index tree structure; for example, each node of the log has a corresponding index-tree node. Frequently repeating keys, values, and correlated key-value pairs may be stored in the index-tree node, which may be indexed using multiple levels of detail including a raw-string level for raw string representations of the node, a first level for indexing keys and common values, and a second level for indexing correlated key-value pairs. The index tree may be used to compress rows of the data log and also used to decompress and restore the log.

First claim

Opening claim text (preview).

The invention claimed is: 1. One or more computer-readable storage media having instructions embodied thereon that, when executed, perform a method for building an index tree for compressing a logfile, the method comprising: receiving a logfile comprising one or more nodes, each node having one or more keys, values, and key-value pairs; for a first node, determining whether to index a raw string representation of the node; upon determining to index the raw string, storing the raw string representation of the node in a raw-string level index; upon determining not to index the raw string: (1) for each key in the node, storing the key in a first level index; (2) for each key-value pair in the node, determining that the key-value pair is correlated; and (3) upon determining that the key-value pair is correlated, storing the key-value pair in a second level index; and creating an index tree corresponding to the logfile from the raw string, first, and second level indexes of the first node. 2. The media of claim 1 , wherein upon determining not to index the raw string further comprises: determining common values of the node by identifying one or more values of the node that repeat in the logfile; and storing the common values in the first level index. 3. The media of claim 1 , wherein determining whether to index a raw string representation of the node includes using a threshold that is set based on a maximum size of the raw-string level index. 4. The media of claim 1 , wherein determining whether the key-value pair is correlated comprises determining a plurality of occurrences of the key-value pair in the logfile. 5. The media of claim 1 , wherein determining whether the key-value pair is correlated comprises determining, in the logfile, that the number of different values that are paired with the key of the key-value pair satisfies a threshold. 6. The media of claim 1 further comprising compressing the first node of the logfile with the index tree. 7. The media of claim 6 , wherein compressing the first node of the logfile comprises: determining whether the first node is indexed as the raw string; upon determining that the first node is indexed as the raw string, replacing the node with a raw-string index entry, in the raw-string level index, that corresponds to the raw string of the node; and upon determining that the first node is not indexed as the raw string: (1) for each key-value pair of the first node, determining whether a candidate key-value pair is indexed in the second level index; (2) upon determining that the candidate key-value pair is indexed in the second level index, replacing the candidate key-value pair with a second-level index entry, in the second level index, that corresponds to the candidate key-value pair; and (3) upon determining that the candidate key-value pair is not indexed in the second level index, replacing the candidate key of the candidate key-value pair with a first-level key index entry, in the first level index, that corresponds to the candidate key. 8. The media of claim 7 , wherein upon determining that the candidate key-value pair is not indexed in the second level index further comprises: determining that the candidate value of the candidate key-value pair is indexed in the first level index; and replacing the candidate value with a first-level value index entry, in the first level index, that corresponds to the candidate value. 9. The media of claim 1 , wherein the logfile comprises a JSON log. 10. The media of claim 1 , wherein the logfile includes data representing search-engine user interaction information from a plurality of users. 11. The media of claim 1 , wherein the method is recursively applied for each node of the one or more nodes of the logfile, and wherein the index tree is further created from the raw string, first, and second level indexes of each node. 12. One or more computer-readable storage media having instructions embodied thereon that, when executed, perform a method for compressing a logfile, the method comprising: receiving a logfile including one or more logfile nodes, each logfile node having one or more keys, values, and key-value pairs; receiving an index tree corresponding to the logfile, the index tree including one or more index tree-nodes corresponding to the one or more logfile nodes, and further comprising a raw-string level index, a first level index, and a second level index corresponding to each index tree-node; for a first logfile node corresponding to a first index tree-node, determining whether the first logfile node is indexed as a raw string; upon determining that the first logfile node is indexed as a raw string, replacing the logfile node with a raw-string index entry, in the raw-string level index, that corresponds to a raw string representation of the logfile node; upon determining that the first logfile node is not indexed as the raw string: (1) for each key-value pair of the first logfile node, determining whether the key-value pair is indexed in the second level index; (2) upon determining that the key-value pair is indexed in the second level index, replacing the key-value pair with a second-level index entry, in the second level index, that corresponds to the key-value pair; and (3) upon determining that the key-value pair is not indexed in the second level index, replacing the key of the key-value pair with a first-level key index entry, in the first level index, that corresponds to the key. 13. The media of claim 12 , wherein upon determining that the key-value pair is not indexed in the second level index further comprises: determining that the value of the key-value pair is indexed in the first level index; and replacing the value with a first-level value index entry, in the first level index, that corresponds to the value. 14. The media of claim 13 , wherein the method is recursively applied for each node of the one or more logfile nodes. 15. The media of claim 12 , wherein the logfile comprises a JSON log. 16. The media of claim 11 , wherein the logfile includes data representing search-engine user interaction information from a plurality of users. 17. One or more computer-readable storage media having instructions embodied thereon that, when executed, perform a method for compressing a logfile using a tree-structured index corresponding to the logfile, the method comprising: receiving a logfile comprising one or more nodes, each node having a one or more keys, values, and key-value pairs; for each node, recursively: (1) determining whether to index a raw string representation of the node; (2) upon determining to index the raw string, storing the raw string representation of the node in a raw-string level dictionary; (3) upon determining not to index the raw string: a) for each key in the node, storing the key in a first-level key dictionary; b) determining common values of the node by identifying one or more values of the node that repeat in the logfile; c) storing the common values in the first-level value dictionary; d) for each key-value pair in the node, determining that the key-value pair is correlated; and e) upon determining that the key-value pair is correlated, storing the key-value pair in a second-level dictionary; creating an index tree corresponding to the logfile, from the raw-string level, first-level value, first-level key, and second-level dictionaries of each node; and for each node of the logfile, compressing the node with the index tree. 18. The media of claim 17 , wherein compressing

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9355111B2 cover?
Computer-readable media, systems, and methods for hierarchical index based compression are described. In embodiments, a hierarchical data log or key-value pair based data log, such as a JSON log, is received and a tree-structured index (index tree) is recursively constructed. In one embodiment, the log comprises search-engine user interaction information. Structural information of the log is pr…
Who is the assignee on this patent?
Microsoft Corp, Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F17/30144. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 31 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).