Using supplementary and secondary alignments to improve compression of genomic alignment files
US-2024079095-A1 · Mar 7, 2024 · US
US10090857B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10090857-B2 |
| Application number | US-201213492505-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 8, 2012 |
| Priority date | Apr 26, 2010 |
| Publication date | Oct 2, 2018 |
| Grant date | Oct 2, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method of compressing sequence data in a text-based format, the method involving parsing text of the sequence data into a plurality of fields, identifying encoding algorithms that achieve greatest compression gains with respect to the plurality of fields based on collected statistics, and generating a bitstream, compressed from the sequence data, by encoding the sequence data using the identified encoding algorithms.
Opening claim text (preview).
What is claimed is: 1. A method of compressing genetic sequencing data in a text-based format, the method, implemented in a processor, comprising the steps of: receiving genetic sequencing data obtained using a high throughput genetic sequencing instrument; parsing information included in text of the genetic sequencing data into a plurality of fields, wherein the information comprises title information, sequence data and quality data and wherein the plurality of fields comprises a title information field, a sequence data field and a quality data field; collecting statistics with respect to a symbol represented by strings that are included in each of the plurality of fields; for each of the plurality of fields, identifying an encoding algorithm that achieves greatest compression gains with respect to the field based on the collected statistics, by determining, for each field of the genetic sequencing data, an optimized encoding algorithm selected from the group consisting of an arithmetic encoding algorithm, a Markov encoding algorithm, and a Huffman encoding algorithm; generating bitstreams, compressed from the genetic sequencing data, by encoding each of the plurality of fields of the genetic sequencing data using the respective identified encoding algorithm; and outputting a unified bitstream by merging the generated bitstreams encoded for each of the plurality of fields. 2. The method of claim 1 , wherein the text of the genetic sequencing data includes a title line, and the method comprises parsing the title line information to identify constant fields, variable fields, and delimiters. 3. The method of claim 2 , wherein the variable fields are further parsed to identify a numeric variable field and an alphanumeric variable field. 4. The method of claim 3 , wherein the optimized encoding algorithm for a field is algorithms are identified by employing computing an entropy calculations for the numeric variable field. 5. The method of claim 1 , wherein the text-based format is an FASTQ format. 6. The method of claim 1 further comprising, before the parsing the text, determining an origin of the genetic sequencing data in the text-based format, said origin comprising a sequencing system type selected from the group consisting of SoLiD, illumine, 454 and Helicos. 7. The method of claim 1 , wherein, if the genetic sequencing data includes a length field representing a length of a DNA sequence read comprised in the text, the method comprises discarding a value of the field length before collecting the statistics. 8. The method of claim 1 , wherein collecting the statistics comprises checking for inconsistencies between title lines included in the text. 9. The method of claim 1 , wherein collecting the statistics comprises identifying a quality value (Qmax) with a maximum occurrence in the text. 10. The method of claim 9 , wherein a quality stream for each read included in the text is represented as an offset, a quality symbol, and a run length. 11. The method of claim 9 , wherein the quality value in a quality stream is represented as offset and <quality symbol, run length>. 12. The method of claim 1 , further comprising detecting ambiguous symbols in quality scores included in the text, wherein generating the bitstreams comprises encoding the occurrence once. 13. The method of claim 12 , further comprising allocating a lowest quality value to each position corresponding to an ambiguous symbol in the text, and wherein the generating the bitstreams includes using a result of the allocation. 14. The method of claim 1 , wherein the bitstreams are generated using lossless compression or near-lossless compression. 15. The method of claim 1 , wherein the parsing text of the genetic sequencing data comprises parsing a DNA sequence read included in the text by identifying repeats and non-repeats. 16. A non-transitory computer-readable medium having recorded thereon a program for executing a method of claim 1 . 17. An apparatus for compressing genetic sequencing data in a text-based format, the apparatus including a processor and a non-transitory processor-readable medium having processor-executable instructions stored thereon, the processor-executable instructions comprising instructions for: receiving genetic sequencing data obtained using a high throughput genetic sequencing instrument; parsing information included in text of the genetic sequencing data into a plurality of fields, wherein the information includes title information, sequence data and quality data and wherein the plurality of fields includes a title information field, a sequence data field and a quality data field; collecting statistics with respect to one or more symbols represented by strings that are included in each of the plurality of fields; identifying, for each of the plurality of fields, an encoding algorithm that achieves greatest compression gains with respect to the field based on the collected statistics, by determining, for each field of the genetic sequencing data, an optimized encoding algorithm selected from the group consisting of an arithmetic encoding algorithm, a Markov encoding algorithm, and a Huffman encoding algorithm; generating bitstreams, compressed from the genetic sequencing data, by encoding each of the plurality of fields of the genetic sequencing data using the respective identified encoding algorithm; and outputting a unified bitstream by merging the generated bitstreams encoded for each of the plurality of fields. 18. The apparatus of claim 17 , wherein the instructions for parsing include instructions for parsing the title information line included in the text so as to identify constant fields, variable fields, and delimiters. 19. The apparatus of claim 18 , wherein the instructions for parsing include instructions for parsing the variable fields to identify a numeric variable field and an alphanumeric variable field. 20. The apparatus of claim 19 , wherein the instructions for coding include instructions for identifying the optimized encoding algorithm for a field algorithms by employing computing an entropy calculation for the numeric variable field. 21. The apparatus of claim 17 , wherein the instructions for collecting statistics includes instructions for collecting the statistics by identifying a quality value (Qmax) with a maximum occurrence in the text. 22. The apparatus of claim 21 , wherein a quality stream for each read included in the text is represented as an offset, a quality symbol, and a run length, and wherein the quality value in the quality stream is represented as offset and <quality symbol, run length>. 23. The apparatus of claim 17 , wherein the instructions for collecting statistics includes instructions for determining an occurrence of ambiguous symbols in quality scores included in the text, and wherein the instructions for generating include generating the bitstreams by encoding the occurrence once. 24. The apparatus of claim 23 , wherein the instructions for collecting statistics includes instructions for allocating a lowest quality value to all positions corresponding to ambiguous bases in the text, and wherein the instructions for generating include instructions for generating the bitstreams by using a result of the allocating. 25. The apparatus of claim 17 , wherein the instructions for generating include instructions for generating the bitstreams using lossless compression or near-lo
Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code · CPC title
Compression of genetic data · CPC title
Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.