Processing and analysis of complex nucleic acid sequence data

US2016378916A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016378916-A1
Application numberUS-201615195741-A
CountryUS
Kind codeA1
Filing dateJun 28, 2016
Priority dateJun 15, 2009
Publication dateDec 29, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present invention is directed to logic for analysis of nucleic acid sequence data that employs algorithms that lead to a substantial improvement in sequence accuracy and that can be used to phase sequence variations, e.g., in connection with the use of the long fragment read (LFR) process.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of determining a sequence of at least a portion of a genome of an organism from a sample comprising genomic DNA of the organism, the method comprising: aliquoting the sample to produce a plurality of aliquots, each aliquot comprising less than a genome equivalent of genomic DNA fragments of the organism, the sample including genomic DNA that is not in a cell at the time of aliquoting; tagging the DNA fragments in each aliquot with an aliquot-specific tag sequence to produce tagged fragments; for each aliquot, sequencing the tagged fragments from the aliquot to obtain signals for bases at positions of the tagged fragments; analyzing, by a computer system, the signals to produce a plurality of reads, the analysis including a basecalling process that determines base calls at positions of the tagged fragments, each read comprising an aliquot-specific tag sequence; counting, by the computer system, aliquots that include a particular base call on a read at a particular position in the genome using the aliquot-specific tag sequences, wherein one or more reads from a first number of the aliquots comprise a first base call at a first position in the genome and reads from a second number of aliquots comprise a different second base call at the first position in the genome; identifying, by the computer system, the first base call as a false base call when the first number of the aliquots in which the first base call appears at the first position is less than a first threshold number of aliquots, the first threshold number being two or greater than two; and assembling, by the computer system, the plurality of reads to produce an assembled sequence, wherein the assembled sequence excludes the first base call at the first position when the first base call is identified as a false base call, the assembled sequence corresponding to at least a portion of the genome of the organism. 2 . The method of claim 1 wherein the genome is a mammalian genome and the assembled sequence has a genome call rate of 70 percent or greater and an exome call rate of 70 percent or greater, wherein the assembled sequence comprises no more than one false single nucleotide variant per megabase. 3 . The method of claim 1 wherein the genome comprises at least one gigabase. 4 . The method of claim 1 wherein the genome is double stranded, the method comprising separating single strands of the double stranded genomic DNA before aliquoting. 5 . The method of claim 1 comprising amplifying the DNA fragments in each aliquot. 6 . The method of claim 5 , wherein the amplification uses adapters or random primers. 7 . The method of claim 5 comprising amplifying the DNA fragments in each aliquot by multiple displacement amplification. 8 . The method of claim 5 comprising amplifying the DNA fragments in each aliquot at least 1000-fold. 9 . The method of claim 5 wherein the sample comprises 1 to 20 cells of the organism. 10 . The method of claim 9 wherein the sample comprises cellular contaminants, the method comprising amplifying the DNA fragments in each aliquot in the presence of the cellular contaminants. 11 . The method of claim 9 wherein the cells are circulating non-blood cells from blood of the higher organism. 12 . The method of claim 1 wherein the assembled sequence has a call rate of at least 70 percent of the genome. 13 . The method of claim 1 wherein the sample comprises from 1 pg to 10 ng of the genome. 14 . The method of claim 13 wherein the assembled sequence has fewer than one false single nucleotide variant per megabase. 15 . The method of claim 1 comprising: receiving a plurality of intact cells of the organism; and disrupting the intact cells to release the genomic DNA, thereby producing the sample comprising genomic DNA of the organism. 16 . The method of claim 1 wherein the sample is aliquoted into wells of a multiwall plate. 17 . The method of claim 1 wherein the sample is aliquoted into droplets. 18 . The method of claim 1 comprising: identifying, by the computer system, the first base call as a false base call at the first position in the genome when the first base call appears in at least a second threshold amount of aliquots that also include the second base call at the first position, where the second number of aliquots is greater than the first number of aliquots. 19 . The method of claim 18 , wherein the second threshold amount is a percentage of aliquots that include the false base call. 20 . The method of claim 1 , wherein the fragments are 50-2000 nucleotides in length. 21 . The method of claim 1 , further comprising: determining, by the computer system, the first number of the aliquots by: identifying a first set of reads that align to the first position of the genome and have the first base call; and counting unique aliquot-specific tag sequences in the first set. 22 . The method of claim 21 comprising: determining, by the computer system, reads that align to the first position by aligning reads to each other. 23 . The method of claim 21 comprising: determining, by the computer system, reads that align to the first position by aligning reads to a reference genome. 24 . The method of claim 1 comprising: determining, by the computer system, a third number of aliquots including a particular base call at a second position in the genome; and using, by the computer system, the third number of aliquots to determine whether the particular base call is accepted at the second position. 25 . The method of claim 24 comprising: determining, by the computer system, a score for the particular base call being at the second position in the genome, the score based on the third number of aliquots including the particular base call; and comparing, by the computer system, the score to a first threshold; and identifying, by the computer system, whether the particular base call is accepted or an error based on whether the score is greater than or less than the threshold. 26 . The method of claim 25 comprising: determining, by the computer system, one or more other scores for other base calls at the second position, wherein the second position is determined to be a no call when all of the scores are below a threshold. 27 . The method of claim 25 , wherein the score is a percentage of expected aliquots. 28 . The method of claim 27 comprising: determining, by the computer system, that the second position is heterozygous in the genome when two scores are above a second threshold. 29 . The method of claim 28 wherein a third score for a third other base call is below a third threshold. 30 . The method of claim 1 , wherein the aliquot-specific tag sequence includes an aliquot-specific set of tags. 31 . The method of claim 1 , wherein the signals obtained from the sequencing correspond to intensities of color dyes. 32 . The method of claim 1 , wherein the organism is a human.

Assignees

Inventors

Classifications

  • G06F19/22Primary

    Physics · mapped topic

  • C12Q1/6869Primary

    Methods for sequencing · CPC title

  • G16B30/00Primary

    ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

  • Sequence assembly · CPC title

  • Sequence alignment; Homology search · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016378916A1 cover?
The present invention is directed to logic for analysis of nucleic acid sequence data that employs algorithms that lead to a substantial improvement in sequence accuracy and that can be used to phase sequence variations, e.g., in connection with the use of the long fragment read (LFR) process.
Who is the assignee on this patent?
Complete Genomics Inc
What technology area does this patent fall under?
Primary CPC classification G06F19/22. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 29 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).