Computerized methods of data compression and analysis
US-11269810-B2 · Mar 8, 2022 · US
US12050557B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12050557-B2 |
| Application number | US-202117532947-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 22, 2021 |
| Priority date | May 19, 2017 |
| Publication date | Jul 30, 2024 |
| Grant date | Jul 30, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computerized system and method of compressing symbolic information organized into a plurality of documents, each document having a plurality of symbols, the system and method including: (i) automatically identifying a plurality of sequential (also referred to as adjacent) and/or non-sequential symbol (also referred to as non-adjacent) pairs in an input document; (ii) counting the number of appearances of each unique symbol pair; and (iii) producing a compressed document that includes a replacement symbol at each position associated with one of the plurality of symbol pairs, at least one of which corresponds to a non-sequential symbol pair. For each non-sequential pair the compressed document includes corresponding indicia indicating a distance between locations of the non-sequential symbols of the pair in the input document.
Opening claim text (preview).
What is claimed is: 1. A computerized method of compressing symbolic information organized into a plurality of documents, each document having a plurality of symbols, the method comprising: (a) generating, by a computer based system, a symbol dictionary based on a first uncompressed document of the plurality of documents; performing, by the computer based system and with the symbol dictionary, a first data compression on the first uncompressed document by at least one of the adjacent pair dictionary method and the non-adjacent pair dictionary method to generate a compressed output document; (b) appending, by the computer based system, a new uncompressed document of the plurality of documents to the compressed output document to generate an appended compressed document; (c) updating, by the computer based system, the symbol dictionary based on the appended compressed document to generate an updated symbol dictionary; (d) identifying, by the computer based system, a plurality of symbol pairs, each symbol pair consisting of two sequential symbols in the first uncompressed document; (e) for each unique symbol pair of the plurality of symbol pairs, updating, by the computer based system, a count identifying the number of appearances of the unique symbol pair; and (f) producing, by the computer based system, the compressed output document by causing the compressed output document to include, at each position associated with one of the plurality of symbol pairs from the input document, a replacement symbol associated by a compression dictionary with the unique symbol pair matching the one of the plurality of symbol pairs, if the count for the unique symbol pair exceeds a threshold. 2. The method of claim 1 , further comprising: performing, by the computer based system and with the updated symbol dictionary, a second data compression on the appended compressed document by at least one of the adjacent pair dictionary method and the non-adjacent pair dictionary method. 3. The method of claim 2 , wherein performing the second compression comprises: (a) identifying, by the computer based system, a plurality of symbol pairs, each symbol pair consisting of two sequential symbols in the appended compressed document; (b) for each unique symbol pair of the plurality of symbol pairs, updating, by the computer based system, a count identifying the number of appearances of the unique symbol pair; and (c) producing, by the computer based system, a combined compressed document by causing the combined compressed document to include, at each position associated with one of the plurality of symbol pairs from the input document, a replacement symbol associated by a compression dictionary with the unique symbol pair matching the one of the plurality of symbol pairs, if the count for the unique symbol pair exceeds a threshold. 4. The method of claim 2 , wherein performing the second compression comprises: (a) identifying, by the computer based system, a plurality of symbol pairs, each symbol pair consisting of two sequential or non-sequential symbols in the appended compressed document, one or more symbol pairs consisting of two non-sequential symbols in the appended compressed document; (b) for each unique symbol pair of the plurality of symbol pairs, updating, by the computer based system, a count identifying the number of appearances of the unique symbol pair; and (c) producing, by the computer based system, a combined compressed document by causing the combined compressed document to include, at each position associated with one of the plurality of symbol pairs from the input document, including one or more symbol pairs consisting of two non-sequential symbols, (i) a replacement symbol associated by a compression dictionary with the unique symbol pair matching the one of the plurality of symbol pairs, if the count for the unique symbol pair exceeds a threshold, and (ii) for at least those symbol pairs consisting of two non-sequential symbols, indicia indicating a distance between locations of the non-sequential symbols of the pair in the input document. 5. The method of claim 2 , wherein the second data compression is only performed on an appended portion of the appended compressed document. 6. The method of claim 1 , further comprising: performing, by the computer based system, an analysis of the appended compressed document based on the symbol dictionary to determine whether any new words are present. 7. The method of claim 6 , further comprising: adding, by the computer based system, a new word to the symbol dictionary based on determining the presence of new words in the appended compressed document; and updating, by the computer based system, a frequency count of the symbol dictionary in response to adding the new words. 8. The method of claim 7 , further comprising: sorting, by the computer based system, the symbol dictionary by order of frequency in response to updating the frequency count. 9. The method of claim 1 , wherein performing the first data compression comprises: (a) identifying, by the computer based system, a plurality of symbol pairs, each symbol pair consisting of two sequential or non-sequential symbols in the input document, one or more symbol pairs consisting of two non-sequential symbols in the first uncompressed document; (b) for each unique symbol pair of the plurality of symbol pairs, updating, by the computer based system, a count identifying the number of appearances of the unique symbol pair; and (c) producing, by the computer based system, a compressed document by causing the compressed document to include, at each position associated with one of the plurality of symbol pairs from the input document, including one or more symbol pairs consisting of two nonsequential symbols, (i) a replacement symbol associated by a compression dictionary with the unique symbol pair matching the one of the plurality of symbol pairs, if the count for the unique symbol pair exceeds a threshold, and (ii) for at least those symbol pairs consisting of two nonsequential symbols, indicia indicating a distance between locations of the non-sequential symbols of the pair in the input document. 10. A computer system comprising: (a) a processor; and (b) a tangible, non-transitory memory configured to communicate with the processor, the tangible, non-transitory memory having instructions stored thereon that, in response to execution by the processor, cause the processor to perform operations comprising the computerized method steps of claim 1 . 11. The computer system of claim 10 that further comprises performing, by the computer system and with the updated symbol dictionary, a second data compression on the appended compressed document by at least one of the adjacent pair dictionary method and the non-adjacent pair dictionary method. 12. The computer system of claim 11 that further comprises: (a) identifying, by the computer system, a plurality of symbol pairs, each symbol pair consisting of two sequential symbols in the appended compressed document; (b) for each unique symbol pair of the plurality of symbol pairs, updating, by the computer based system, a count identifying the number of appearances of the unique symbol pair; and (c) producing, by the computer based system, a combined compressed document by causing the combined compressed document to include, at each position associated with one of the plurality of symbol pairs from the input document, a replacement symbol associated by a compression dictionary with the unique symbol pair matching the one of the plurality of symbol pairs, if the count for the unique symbol pair exceeds a threshold.
Information retrieval; Database structures therefor; File system structures therefor · CPC title
Trees, e.g. B+trees · CPC title
Indexing; Web crawling techniques · CPC title
Methods or arrangements to increase the throughput · CPC title
employing the use of a dictionary, e.g. LZ78 · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.