Methods and systems for data analysis and compression

US9929746B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9929746-B2
Application numberUS-201515501804-A
CountryUS
Kind codeB2
Filing dateAug 5, 2015
Priority dateAug 5, 2014
Publication dateMar 27, 2018
Grant dateMar 27, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure provides computer implemented methods and systems for analyzing datasets, such as large data sets output from nucleic acid sequencing technologies. In particular, the present disclosure provides for data analysis comprising computing the BWT of a collection of strings in an incremental, character by character, manner. The present disclosure also provides compression boosting strategies resulting in a BWT of a reordered collection of data that is more compressible by second stage compression methods compared to non-reordered computational analysis.

First claim

Opening claim text (preview).

The invention claimed is: 1. A nucleic acid sequencing system for compressing sequencing data, comprising: a) a processor; and b) a memory coupled with the processor and having instructions that when executed by the processor perform a method comprising: i) receiving a collection of data strings corresponding to a first set of nucleotide data for a first nucleic acid fragment being sequenced in the system; ii) identifying a first character representing a first nucleotide in each of the data strings in the collection; iii) generating a first Burrows Wheeler transform index for a compressed data string containing the first characters corresponding to a first nucleotide of each data string; iv) identifying an additional character representing an additional nucleotide in each of the data strings; and v) updating the first Burrows Wheeler transform index with the additional characters corresponding to each additional nucleotide of the received collection of data strings to form compressed sequencing data. 2. The nucleic acid sequencing system of claim 1 , wherein receiving a collection of data strings corresponding to a first set of nucleotide data comprises receiving a collection of nucleic acid reads from a target sequence in the nucleic acid sequencing system. 3. The nucleic acid sequencing system of claim 2 , wherein the first set of nucleotide data comprises the first nucleotide from each data string corresponding to the target sequence. 4. The nucleic acid sequencing system of claim 2 , wherein the target sequence is a genomic DNA sequence. 5. The nucleic acid sequencing system of claim 1 , wherein the system repeats steps iv) and v) for each nucleotide in the collection of data strings to update the Burrows Wheeler transform index with all of the nucleotides in the collection of data strings. 6. The nucleic acid sequencing system of claim 1 , further comprising a server comprising a copy of the collection of data strings and the first Burrows Wheeler transform index. 7. The nucleic acid sequencing system of claim 6 , wherein the memory has instructions that when executed by the processor perform a further method comprising: vi) determining a predicted next nucleotide for each of the data strings; vii) determining a confirmed nucleotide by receiving a second set of nucleotide data that confirms the identity of the next nucleotide in the nucleic acid sequence; viii) creating a file of difference information comprising the differences between the predicted nucleotide and the confirmed nucleotide; and ix) compressing the file of difference information to form a compressed sequence data file. 8. The nucleic acid sequencing system of claim 7 , wherein the memory has instructions that when executed by the processor determine the predicted next nucleotide, at least partly, on the Burrows Wheeler transform index. 9. The nucleic acid sequencing system of claim 7 , wherein the instructions, when executed by the processor, perform a further method comprising sending the compressed file of difference information to a server having a copy of the first set of nucleotide data. 10. The nucleic acid sequencing system of claim 7 , wherein creating a file of difference information comprises creating a file with a zero for each confirmed nucleotide that is the same as the predicted nucleotide, and a character representing the confirmed nucleotide for each confirmed nucleotide that is different from the predicted nucleotide. 11. The nucleic acid sequencing system claim 7 , wherein compressing the file of difference information comprises replacing the zeros in the file of difference information with a reference to the number of zeros being replaced. 12. A nucleic acid sequencing system for compressing sequencing data, comprising: a) a processor; and b) a memory coupled with the processor and having instructions that when executed by the processor perform a method comprising: i) receiving a collection of data strings corresponding to a first set of nucleotide data for a first nucleic acid fragment being sequenced in the system; ii) identifying a first character representing a first nucleotide in each of the data strings in the collection; iii) determining a predicted next nucleotide for each of the data strings; iv) determining a confirmed nucleotide by receiving a second set of nucleotide data that confirms the identity of the next nucleotide in the nucleic acid sequence; v) creating a file of difference information comprising the differences between the predicted nucleotide and the confirmed nucleotide; and vi) compressing the file of difference information to form a compressed sequence data file. 13. The nucleic acid sequencing system of claim 12 , further comprising a server having a processor with instructions that when executed perform a method comprising: receiving the compressed file of difference information; comparing the compressed file of difference information to a data string in a collection of data strings; and replacing predicted nucleotides in the data string with confirmed nucleotides from the compressed file of difference information to form an updated data string. 14. The nucleic acid sequencing system of claim 12 , wherein determining the predicted next nucleotide for each of the data strings comprises performing a Burrows Wheeler transform. 15. The nucleic acid sequencing system of claim 12 , wherein creating a file of difference information comprises creating a file with a zero for each confirmed nucleotide that is the same as the predicted nucleotide, and a character representing the confirmed nucleotide for each confirmed nucleotide that is different from the predicted nucleotide. 16. A method of compressing sequencing data, comprising: a. receiving a collection of data strings corresponding to a first set of nucleotide data for a first nucleic acid fragment being sequenced in the system; b. identifying a first character representing a first nucleotide in each of the data strings in the collection; c. determining a predicted next nucleotide for each of the data strings; d. determining a confirmed nucleotide by receiving a second set of nucleotide data that confirms the identity of the next nucleotide in the nucleic acid sequence; e. creating a file of difference information comprising the differences between the predicted nucleotide and the confirmed nucleotide; and f. compressing the file of difference information to form a compressed sequence data file. 17. The method of claim 16 , wherein determining the predicted next nucleotide for each of the data strings comprises performing a Burrows Wheeler transform. 18. The method of claim 16 , wherein creating a file of difference information comprises creating a file with a zero for each confirmed nucleotide that is the same as the predicted nucleotide, and a character representing the confirmed nucleotide for each confirmed nucleotide that is different from the predicted nucleotide. 19. The method of claim 16 , wherein the first set of nucleotide data comprises the first nucleotide from each data string corresponding to the target sequence.

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • H03M7/3068Primary

    Precoding preceding compression, e.g. Burrows-Wheeler transformation · CPC title

  • Compression of genetic data · CPC title

  • ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

  • G16B50/00Primary

    ICT programming tools or database systems specially adapted for bioinformatics · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9929746B2 cover?
The present disclosure provides computer implemented methods and systems for analyzing datasets, such as large data sets output from nucleic acid sequencing technologies. In particular, the present disclosure provides for data analysis comprising computing the BWT of a collection of strings in an incremental, character by character, manner. The present disclosure also provides compression boost…
Who is the assignee on this patent?
Illumina Cambridge Ltd
What technology area does this patent fall under?
Primary CPC classification H03M7/3068. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Mar 27 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).