Searchable, streaming text compression and decompression using a dictionary

US10498355B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10498355-B2
Application numberUS-201514979727-A
CountryUS
Kind codeB2
Filing dateDec 28, 2015
Priority dateJan 4, 2015
Publication dateDec 3, 2019
Grant dateDec 3, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure provides methods, computer program products and apparatuses for text compression and decompression wherein a desired compression ratio may be obtained, and the compressed content per se is still in a searchable text form, thereby providing a possibility for searching without decompression and significantly saving storage space and enhancing search efficiency, and in turn, reducing the total cost ownership TCO and providing a better user experience.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for compressing and searching text data, the method comprising: building a text dictionary including a plurality of key value pairs, each of which includes a text compression value and a corresponding compressible text item, wherein the text compression value has a length shorter than the compressible text item, wherein building the text dictionary includes: searching a set of sample text data previously generated by a particular application for text patterns that exceed a predetermined threshold length (MinNum) and that appear in the set of sample text data at least a predetermined threshold number of times, yielding a set of eligible text patterns; assigning a weight to each text pattern in the set of eligible text patterns by multiplying a number of times (f) that that text pattern appears in the set of sample text data by a difference between a length of that text pattern (len) and the predetermined threshold length (MinNum); selecting, as the compressible text items included within the text dictionary, the text patterns of the set of eligible text patterns that have the N highest assigned weights, wherein N represents a maximum number of key value pairs allowed in the text dictionary; and assigning version information to the text dictionary; receiving a text data stream at a compression engine from the particular application; searching for a compressible text item in the text data stream based on the text dictionary; creating a compression log by replacing the compressible text item searched in the text data stream with a corresponding text compression value, so as to compress the text data stream; labelling the compression log with compression version information identical to version information of the text dictionary; after creating the compression log, receiving an uncompressed query term from a user at the compression engine; compressing, by the compression engine, the uncompressed query term with reference to the text dictionary to yield a compressed query value smaller than the uncompressed query term; searching for the compressed query value within the compression log without performing decompression on the compression log, yielding a search result in compressed form; decompressing the search result with reference to the text dictionary to yield an uncompressed search query result; and returning the uncompressed search query result to the user, the uncompressed search query result including an instance of the uncompressed query term within the text data stream. 2. The method according to claim 1 , further comprising adding a new compressible text item to the text dictionary if the new compressible text item appears in the text data stream more than a threshold number of times. 3. The method according to claim 2 , wherein the new compressible text item further comprises one or more of a statement appearing more than a threshold number of times in the text data, a pattern inherent to a programming language used, and a fixed format for a text type. 4. The method according to claim 1 , wherein the text dictionary has a predetermined size, and wherein a compressible text item in the text dictionary is determined based on a weight of the text item which is determined at least based on a length of the text item and a threshold value. 5. The method according to claim 1 , wherein the text data stream is a log data in a form of text stream. 6. The method of claim 1 wherein the maximum number of key value pairs allowed in the text dictionary is 65535. 7. The method of claim 1 wherein the predetermined threshold length (MinNum) is three. 8. A non-transitory computer program product for compressing and searching text data, the computer program product comprising a non-transitory medium having computer executable instructions stored thereon configured to cause a computer to: build a text dictionary including a plurality of key value pairs, each of which includes a text compression value and a corresponding compressible text item, wherein the text compression value has a length shorter than the compressible text item, wherein building the text dictionary includes: searching a set of sample text data previously generated by a particular application for text patterns that exceed a predetermined threshold length (MinNum) and that appear in the set of sample text data at least a predetermined threshold number of times, yielding a set of eligible text patterns; assigning a weight to each text pattern in the set of eligible text patterns by multiplying a number of times (f) that that text pattern appears in the set of sample text data by a difference between a length of that text pattern (len) and the predetermined threshold length (MinNum); selecting, as the compressible text items included within the text dictionary, the text patterns of the set of eligible text patterns that have the N highest assigned weights, wherein N represents a maximum number of key value pairs allowed in the text dictionary; and assigning version information to the text dictionary; receive a text data stream at a compression engine from the particular application; search for a compressible text item in the text data stream based on the text dictionary; create a compression log by replacing the compressible text item searched in the text data stream with the corresponding text compression value, so as to compress the text data; labelling the compression log with compression version information identical to version information of the text dictionary; after creating the compression log, receive an uncompressed query term from a user at the compression engine; compress, by the compression engine, the uncompressed query term with reference to the text dictionary to yield a compressed query value smaller than the uncompressed query term; search for the compressed query value within the compression log without performing decompression on the compression log, yielding a search result in compressed form; decompress the search result with reference to the text dictionary to yield an uncompressed search query result; and return the uncompressed search query result to the user, the uncompressed search query result including an instance of the uncompressed query term within the text data stream. 9. The computer program product of claim 8 wherein the maximum number of key value pairs allowed in the text dictionary is 65535. 10. The computer program product of claim 8 wherein the predetermined threshold length (MinNum) is three.

Assignees

Inventors

Classifications

  • G06F40/126Primary

    Character encoding · CPC title

  • Decoder aspects · CPC title

  • using adaptive string matching, e.g. the Lempel-Ziv method · CPC title

  • Compression (speech analysis-synthesis for redundancy reduction G10L19/00; for image communication H04N); Expansion; Suppression of unnecessary data, e.g. redundancy reduction · CPC title

  • H03M7/3088Primary

    employing the use of a dictionary, e.g. LZ78 · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10498355B2 cover?
The present disclosure provides methods, computer program products and apparatuses for text compression and decompression wherein a desired compression ratio may be obtained, and the compressed content per se is still in a searchable text form, thereby providing a possibility for searching without decompression and significantly saving storage space and enhancing search efficiency, and in turn,…
Who is the assignee on this patent?
Emc Corp, Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/126. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 03 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).