Who is the assignee on this patent?

Bhola Vishal, Bopardikar Shyamsunder Ajit, Narayanan Rangavittal, and 3 more

What technology area does this patent fall under?

Primary CPC classification G16B50/50. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 02 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method and apparatus for compressing genetic data

US10090857B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10090857-B2
Application number	US-201213492505-A
Country	US
Kind code	B2
Filing date	Jun 8, 2012
Priority date	Apr 26, 2010
Publication date	Oct 2, 2018
Grant date	Oct 2, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method of compressing sequence data in a text-based format, the method involving parsing text of the sequence data into a plurality of fields, identifying encoding algorithms that achieve greatest compression gains with respect to the plurality of fields based on collected statistics, and generating a bitstream, compressed from the sequence data, by encoding the sequence data using the identified encoding algorithms.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of compressing genetic sequencing data in a text-based format, the method, implemented in a processor, comprising the steps of: receiving genetic sequencing data obtained using a high throughput genetic sequencing instrument; parsing information included in text of the genetic sequencing data into a plurality of fields, wherein the information comprises title information, sequence data and quality data and wherein the plurality of fields comprises a title information field, a sequence data field and a quality data field; collecting statistics with respect to a symbol represented by strings that are included in each of the plurality of fields; for each of the plurality of fields, identifying an encoding algorithm that achieves greatest compression gains with respect to the field based on the collected statistics, by determining, for each field of the genetic sequencing data, an optimized encoding algorithm selected from the group consisting of an arithmetic encoding algorithm, a Markov encoding algorithm, and a Huffman encoding algorithm; generating bitstreams, compressed from the genetic sequencing data, by encoding each of the plurality of fields of the genetic sequencing data using the respective identified encoding algorithm; and outputting a unified bitstream by merging the generated bitstreams encoded for each of the plurality of fields. 2. The method of claim 1 , wherein the text of the genetic sequencing data includes a title line, and the method comprises parsing the title line information to identify constant fields, variable fields, and delimiters. 3. The method of claim 2 , wherein the variable fields are further parsed to identify a numeric variable field and an alphanumeric variable field. 4. The method of claim 3 , wherein the optimized encoding algorithm for a field is algorithms are identified by employing computing an entropy calculations for the numeric variable field. 5. The method of claim 1 , wherein the text-based format is an FASTQ format. 6. The method of claim 1 further comprising, before the parsing the text, determining an origin of the genetic sequencing data in the text-based format, said origin comprising a sequencing system type selected from the group consisting of SoLiD, illumine, 454 and Helicos. 7. The method of claim 1 , wherein, if the genetic sequencing data includes a length field representing a length of a DNA sequence read comprised in the text, the method comprises discarding a value of the field length before collecting the statistics. 8. The method of claim 1 , wherein collecting the statistics comprises checking for inconsistencies between title lines included in the text. 9. The method of claim 1 , wherein collecting the statistics comprises identifying a quality value (Qmax) with a maximum occurrence in the text. 10. The method of claim 9 , wherein a quality stream for each read included in the text is represented as an offset, a quality symbol, and a run length. 11. The method of claim 9 , wherein the quality value in a quality stream is represented as offset and <quality symbol, run length>. 12. The method of claim 1 , further comprising detecting ambiguous symbols in quality scores included in the text, wherein generating the bitstreams comprises encoding the occurrence once. 13. The method of claim 12 , further comprising allocating a lowest quality value to each position corresponding to an ambiguous symbol in the text, and wherein the generating the bitstreams includes using a result of the allocation. 14. The method of claim 1 , wherein the bitstreams are generated using lossless compression or near-lossless compression. 15. The method of claim 1 , wherein the parsing text of the genetic sequencing data comprises parsing a DNA sequence read included in the text by identifying repeats and non-repeats. 16. A non-transitory computer-readable medium having recorded thereon a program for executing a method of claim 1 . 17. An apparatus for compressing genetic sequencing data in a text-based format, the apparatus including a processor and a non-transitory processor-readable medium having processor-executable instructions stored thereon, the processor-executable instructions comprising instructions for: receiving genetic sequencing data obtained using a high throughput genetic sequencing instrument; parsing information included in text of the genetic sequencing data into a plurality of fields, wherein the information includes title information, sequence data and quality data and wherein the plurality of fields includes a title information field, a sequence data field and a quality data field; collecting statistics with respect to one or more symbols represented by strings that are included in each of the plurality of fields; identifying, for each of the plurality of fields, an encoding algorithm that achieves greatest compression gains with respect to the field based on the collected statistics, by determining, for each field of the genetic sequencing data, an optimized encoding algorithm selected from the group consisting of an arithmetic encoding algorithm, a Markov encoding algorithm, and a Huffman encoding algorithm; generating bitstreams, compressed from the genetic sequencing data, by encoding each of the plurality of fields of the genetic sequencing data using the respective identified encoding algorithm; and outputting a unified bitstream by merging the generated bitstreams encoded for each of the plurality of fields. 18. The apparatus of claim 17 , wherein the instructions for parsing include instructions for parsing the title information line included in the text so as to identify constant fields, variable fields, and delimiters. 19. The apparatus of claim 18 , wherein the instructions for parsing include instructions for parsing the variable fields to identify a numeric variable field and an alphanumeric variable field. 20. The apparatus of claim 19 , wherein the instructions for coding include instructions for identifying the optimized encoding algorithm for a field algorithms by employing computing an entropy calculation for the numeric variable field. 21. The apparatus of claim 17 , wherein the instructions for collecting statistics includes instructions for collecting the statistics by identifying a quality value (Qmax) with a maximum occurrence in the text. 22. The apparatus of claim 21 , wherein a quality stream for each read included in the text is represented as an offset, a quality symbol, and a run length, and wherein the quality value in the quality stream is represented as offset and <quality symbol, run length>. 23. The apparatus of claim 17 , wherein the instructions for collecting statistics includes instructions for determining an occurrence of ambiguous symbols in quality scores included in the text, and wherein the instructions for generating include generating the bitstreams by encoding the occurrence once. 24. The apparatus of claim 23 , wherein the instructions for collecting statistics includes instructions for allocating a lowest quality value to all positions corresponding to ambiguous bases in the text, and wherein the instructions for generating include instructions for generating the bitstreams by using a result of the allocating. 25. The apparatus of claim 17 , wherein the instructions for generating include instructions for generating the bitstreams using lossless compression or near-lo

Assignees

Inventors

Classifications

H03M7/40
Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code · CPC title
G16B50/50Primary
Compression of genetic data · CPC title
H03M7/46Primary
Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind · CPC title

Patent family

Related publications grouped by family.

View patent family 47598127

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10090857B2 cover?: A method of compressing sequence data in a text-based format, the method involving parsing text of the sequence data into a plurality of fields, identifying encoding algorithms that achieve greatest compression gains with respect to the plurality of fields based on collected statistics, and generating a bitstream, compressed from the sequence data, by encoding the sequence data using the identi…
Who is the assignee on this patent?: Bhola Vishal, Bopardikar Shyamsunder Ajit, Narayanan Rangavittal, and 3 more
What technology area does this patent fall under?: Primary CPC classification G16B50/50. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 02 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).