Non-transitory computer-readable recording medium, encoded data searching method, and encoded data searching apparatus
US-2018101580-A1 · Apr 12, 2018 · US
US10498355B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10498355-B2 |
| Application number | US-201514979727-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 28, 2015 |
| Priority date | Jan 4, 2015 |
| Publication date | Dec 3, 2019 |
| Grant date | Dec 3, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure provides methods, computer program products and apparatuses for text compression and decompression wherein a desired compression ratio may be obtained, and the compressed content per se is still in a searchable text form, thereby providing a possibility for searching without decompression and significantly saving storage space and enhancing search efficiency, and in turn, reducing the total cost ownership TCO and providing a better user experience.
Opening claim text (preview).
What is claimed is: 1. A method for compressing and searching text data, the method comprising: building a text dictionary including a plurality of key value pairs, each of which includes a text compression value and a corresponding compressible text item, wherein the text compression value has a length shorter than the compressible text item, wherein building the text dictionary includes: searching a set of sample text data previously generated by a particular application for text patterns that exceed a predetermined threshold length (MinNum) and that appear in the set of sample text data at least a predetermined threshold number of times, yielding a set of eligible text patterns; assigning a weight to each text pattern in the set of eligible text patterns by multiplying a number of times (f) that that text pattern appears in the set of sample text data by a difference between a length of that text pattern (len) and the predetermined threshold length (MinNum); selecting, as the compressible text items included within the text dictionary, the text patterns of the set of eligible text patterns that have the N highest assigned weights, wherein N represents a maximum number of key value pairs allowed in the text dictionary; and assigning version information to the text dictionary; receiving a text data stream at a compression engine from the particular application; searching for a compressible text item in the text data stream based on the text dictionary; creating a compression log by replacing the compressible text item searched in the text data stream with a corresponding text compression value, so as to compress the text data stream; labelling the compression log with compression version information identical to version information of the text dictionary; after creating the compression log, receiving an uncompressed query term from a user at the compression engine; compressing, by the compression engine, the uncompressed query term with reference to the text dictionary to yield a compressed query value smaller than the uncompressed query term; searching for the compressed query value within the compression log without performing decompression on the compression log, yielding a search result in compressed form; decompressing the search result with reference to the text dictionary to yield an uncompressed search query result; and returning the uncompressed search query result to the user, the uncompressed search query result including an instance of the uncompressed query term within the text data stream. 2. The method according to claim 1 , further comprising adding a new compressible text item to the text dictionary if the new compressible text item appears in the text data stream more than a threshold number of times. 3. The method according to claim 2 , wherein the new compressible text item further comprises one or more of a statement appearing more than a threshold number of times in the text data, a pattern inherent to a programming language used, and a fixed format for a text type. 4. The method according to claim 1 , wherein the text dictionary has a predetermined size, and wherein a compressible text item in the text dictionary is determined based on a weight of the text item which is determined at least based on a length of the text item and a threshold value. 5. The method according to claim 1 , wherein the text data stream is a log data in a form of text stream. 6. The method of claim 1 wherein the maximum number of key value pairs allowed in the text dictionary is 65535. 7. The method of claim 1 wherein the predetermined threshold length (MinNum) is three. 8. A non-transitory computer program product for compressing and searching text data, the computer program product comprising a non-transitory medium having computer executable instructions stored thereon configured to cause a computer to: build a text dictionary including a plurality of key value pairs, each of which includes a text compression value and a corresponding compressible text item, wherein the text compression value has a length shorter than the compressible text item, wherein building the text dictionary includes: searching a set of sample text data previously generated by a particular application for text patterns that exceed a predetermined threshold length (MinNum) and that appear in the set of sample text data at least a predetermined threshold number of times, yielding a set of eligible text patterns; assigning a weight to each text pattern in the set of eligible text patterns by multiplying a number of times (f) that that text pattern appears in the set of sample text data by a difference between a length of that text pattern (len) and the predetermined threshold length (MinNum); selecting, as the compressible text items included within the text dictionary, the text patterns of the set of eligible text patterns that have the N highest assigned weights, wherein N represents a maximum number of key value pairs allowed in the text dictionary; and assigning version information to the text dictionary; receive a text data stream at a compression engine from the particular application; search for a compressible text item in the text data stream based on the text dictionary; create a compression log by replacing the compressible text item searched in the text data stream with the corresponding text compression value, so as to compress the text data; labelling the compression log with compression version information identical to version information of the text dictionary; after creating the compression log, receive an uncompressed query term from a user at the compression engine; compress, by the compression engine, the uncompressed query term with reference to the text dictionary to yield a compressed query value smaller than the uncompressed query term; search for the compressed query value within the compression log without performing decompression on the compression log, yielding a search result in compressed form; decompress the search result with reference to the text dictionary to yield an uncompressed search query result; and return the uncompressed search query result to the user, the uncompressed search query result including an instance of the uncompressed query term within the text data stream. 9. The computer program product of claim 8 wherein the maximum number of key value pairs allowed in the text dictionary is 65535. 10. The computer program product of claim 8 wherein the predetermined threshold length (MinNum) is three.
Character encoding · CPC title
Decoder aspects · CPC title
using adaptive string matching, e.g. the Lempel-Ziv method · CPC title
Compression (speech analysis-synthesis for redundancy reduction G10L19/00; for image communication H04N); Expansion; Suppression of unnecessary data, e.g. redundancy reduction · CPC title
employing the use of a dictionary, e.g. LZ78 · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.