Quality score compression

US11527307B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11527307-B2
Application numberUS-202117520615-A
CountryUS
Kind codeB2
Filing dateNov 5, 2021
Priority dateNov 5, 2020
Publication dateDec 13, 2022
Grant dateDec 13, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and computer programs for compressing nucleic acid sequence data. A method can include obtaining nucleic acid sequence data representing: (i) a read sequence, and (ii) a plurality of quality scores, determining whether the read sequence includes at least one “N” base, based on a determination that the read sequence does not include at least one “N” base, generating a first encoded data set by using a first encoding process to encode each of the quality scores of the read sequence using a base-(x minus 1) number, where x is an integer representing a number of different quality scores used by the nucleic acid sequencing device, and using a second encoding process to encode the first encoded data set, thereby compressing the data to be compressed.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for compressing nucleic acid sequence data, the method comprising: obtaining, by one or more computers, nucleic acid sequence data representing: (i) a read sequence comprising data that corresponds to a plurality of base calls generated by a nucleic acid sequencing device, and (ii) a plurality of quality scores, wherein each quality score of the plurality of quality scores indicates a likelihood that a particular base call of the read sequence was correctly generated by a nucleic acid sequencing device; determining, by one or more computers, whether the read sequence includes at least one “N” base; based on a determination that the read sequence does not include at least one “N” base, generating, by one or more computers, a first encoded data set by using a first encoding process to encode each of the quality scores of the read sequence using a base-(x minus 1) number, where x is an integer representing a number of different quality scores used by the nucleic acid sequencing device; and using, by one or more computers, a second encoding process to encode the first encoded data set, thereby compressing the data to be compressed. 2. The method of claim 1 , wherein x is equal to 3. 3. The method of claim 2 , wherein the first encoding process comprises encoding, by one or more computers, each set of five quality scores of the plurality of quality scores of the read sequence into a single byte by representing each quality score of the set of five quality scores as a base-3 number. 4. The method of claim 1 , further comprising: based on a determination that the read sequence includes at least one “N” base, generating, by one or more computers, a second encoding data set by using a third encoding process to encode each set of four quality scores of the read sequence into a single byte of memory; and using, by one or more computers, a fourth encoding process to encode the second encoding data. 5. The method of claim 4 , wherein the second encoding process and the fourth encoding process are the same. 6. The method of claim 1 , wherein the obtained data includes a FASTQ file. 7. The method of claim 1 , wherein the first encoded data set is a compressed version of the plurality of quality scores. 8. The method of claim 1 , wherein the second encoding process is a compression process. 9. The method of claim 8 , wherein the compression process comprises a Prediction by Partial Matching (PPMD) implementation of a range encoder. 10. The method of claim 9 , wherein, for a given value of the first encoded data set, the given value is compressed according to a 4-bit context relative to the position of the given value within the first encoded data set. 11. A system for compressing nucleic acid sequence data, the system comprising: one or more data processing apparatus; and one or more non-transitory computer-readable storage devices having stored thereon instructions that, when executed by the one or more data processing apparatus, cause the one or more data processing apparatus to perform operations, the operations comprising: obtaining, by the one or more computers, nucleic acid sequence data representing: (i) a read sequence comprising data that corresponds to a plurality of base calls generated by a nucleic acid sequencing device, and (ii) a plurality of quality scores, wherein each quality score of the plurality of quality scores indicates a likelihood that a particular base call of the read sequence was correctly generated by a nucleic acid sequencing device; determining, by the one or more computers, whether the read sequence includes at least one “N” base; based on a determination that the read sequence does not include at least one “N” base, generating, by the one or more computers, a first encoded data set by using a first encoding process to encode each of the quality scores of the read sequence using a base-(x minus 1) number, where x is an integer representing a number of different quality scores used by the nucleic acid sequencing device; and using, by the one or more computers, a second encoding process to encode the first encoded data set, thereby compressing the data to be compressed. 12. The system of claim 11 , wherein x is equal to 3. 13. The system of claim 12 , wherein the first encoding process comprises encoding, by the one or more computers, each set of five quality scores of the plurality of quality scores of the read sequence into a single byte by representing each quality score of the set of five quality scores as a base-3 number. 14. The system of claim 11 , the operations further comprising: based on a determination that the read sequence includes at least one “N” base, generating, by the one or more computers, a second encoding data set by using a third encoding process to encode each set of four quality scores of the read sequence into a single byte of memory; and using, by the one or more computers, a fourth encoding process to encode the second encoding data. 15. The system of claim 14 , wherein the second encoding process and the fourth encoding process are the same. 16. The system of claim 11 , wherein the obtained data includes a FASTQ file. 17. The system of claim 11 , wherein the first encoded data set is a compressed version of the plurality of quality scores. 18. The system of claim 11 , wherein the second encoding process is a compression process. 19. The system of claim 18 , wherein the compression process comprises a Prediction by Partial Matching (PPMD) implementation of a range encoder. 20. The system of claim 19 , wherein, for a given value of the first encoded data set, the given value is compressed according to a 4-bit context relative to the position of the given value within the first encoded data set. 21. A non-transitory computer-readable storage device having stored thereon instructions, which, when executed by a data processing apparatus, cause the data processing apparatus to perform operations, the operations comprising: obtaining nucleic acid sequence data representing: (i) a read sequence comprising data that corresponds to a plurality of base calls generated by a nucleic acid sequencing device, and (ii) a plurality of quality scores, wherein each quality score of the plurality of quality scores indicates a likelihood that a particular base call of the read sequence was correctly generated by a nucleic acid sequencing device; determining, by one or more computers, whether the read sequence includes at least one “N” base; based on a determination that the read sequence does not include at least one “N” base, generating a first encoded data set by using a first encoding process to encode each of the quality scores of the read sequence using a base-(x minus 1) number, where x is an integer representing a number of different quality scores used by the nucleic acid sequencing device; and using a second encoding process to encode the first encoded data set, thereby compressing the data to be compressed. 22. The computer-readable storage device of claim 21 , wherein x is equal to 3. 23. The computer-readable storage device of claim 22 , wherein the first encoding process comprises encoding each set of five quality scores of the plurality of quality scores of the read sequence into a single byte by representing each quality score of the set of five quality scores as a base-3 number. 24. The computer-readable storage device of claim 21 ,

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11527307B2 cover?
Methods, systems, and computer programs for compressing nucleic acid sequence data. A method can include obtaining nucleic acid sequence data representing: (i) a read sequence, and (ii) a plurality of quality scores, determining whether the read sequence includes at least one “N” base, based on a determination that the read sequence does not include at least one “N” base, generating a first enc…
Who is the assignee on this patent?
Illumina Inc
What technology area does this patent fall under?
Primary CPC classification G16B50/50. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 13 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).