Methods and Systems for Processing Polynucleotides
US-2015218633-A1 · Aug 6, 2015 · US
US11081208B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11081208-B2 |
| Application number | US-201615242256-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 19, 2016 |
| Priority date | Feb 11, 2016 |
| Publication date | Aug 3, 2021 |
| Grant date | Aug 3, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Described are computer-implemented methods, systems, and media for de novo phased diploid assembly of nucleic acid sequence data generated from a nucleic acid sample of an individual utilizing nucleic acid tags to preserve long-range sequence context for the individual such that a subset of short-read sequence data derived from a common starting sequence shares a common tag. The phased diploid assembly is achieved without alignment to a reference sequence derived from organisms other than the individual. The methods, systems, and media described are computer-resource efficient, allowing scale-up.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method of de novo genome assembly for nucleic acid sequence data generated from a nucleic acid sample of an organism, the method comprising: (a) generating, by one or more computers, an initial assembly based on short-read sequence data, the initial assembly comprising one or more areas of unresolved sequence ambiguity, wherein the short-read sequence data is derived from longer starting sequences from the nucleic acid sequence data and is tagged to preserve long-range sequence context for the organism such that a subset of the short-read sequence data derived from a common starting sequence share one or more common tags; (b) generating, by the one or more computers, a plurality of local assemblies based on the initial assembly by utilizing the one or more common tags to resolve the one or more areas of unresolved sequence ambiguity, wherein the plurality of local assemblies is generated by: (i) using the initial assembly as an interim reference; (ii) identifying edges of an unambiguous sequence; (iii) identifying neighboring edges sharing a number of the one or more common tags above a threshold number of common tags; and (iv) bringing together edges of the unambiguous sequence with the neighboring edges identified; (c) generating, by the one or more computers, a global assembly based on the plurality of local assemblies; (d) cleaning, by the one or more computers, the global assembly by removing sequence data inconsistent with the long-range sequence context indicated by the one or more common tags; and (e) generating, by the one or more computers, a phased genome assembly based on the global assembly by utilizing the one or more common tags to separate a phased nucleotide sequence; wherein the phased genome assembly is achieved without alignment to a reference sequence or any independently generated genome sequence. 2. The method of claim 1 , wherein the phased genome assembly is for a diploid genome. 3. The method of claim 1 , wherein the short-read sequence data is generated from a single library. 4. The method of claim 1 , wherein the short-read sequence data results in 50× or less coverage of a genome of the organism. 5. The method of claim 1 , wherein the short-read sequence data is tagged to preserve context over a starting sequence 2×-1000× longer than reads from the short-read sequence data. 6. The method of claim 1 , wherein the initial assembly is an initial assembly graph. 7. The method of claim 6 , wherein the initial assembly graph is generated by: (a) identifying a plurality of k-mers that have a high probability of being present in a genome of the organism; (b) using the one or more common tags to filter the plurality of k-mers based on a number of starting sequences each k-mer occurs in; and (c) bringing together k-mers in the plurality of k-mers sharing a common 1-mer to form an initial assembly, wherein 1<k. 8. The method of claim 7 , further comprising applying, by the one or more computers, a preliminary filter prior to generating the initial assembly, wherein the preliminary filter comprises: (a) utilization of base quality scores from a sequencer used to generate the short-read sequence data, and (b) utilization of k-mers that occur more than once and the one or more common tags, such that each k-mer must be seen arising from two distinct common tags. 9. The method of claim 8 , further comprising applying, by the one or more computers, lossless random access compression to each record of the base quality scores and paths through the graph. 10. The method of claim 7 , wherein method further comprises revising, by the one or more computers, the initial assembly graph by: (a) eliminating the one or more areas of unresolved sequence ambiguity based on a number of reads available for each option within an area of unresolved sequence ambiguity; and (b) filling in gaps in the initial assembly graph by consulting the short-read sequence data. 11. The method of claim 7 , wherein k is between 24 and 96. 12. The method of claim 7 , wherein the global assembly is generated by: (a) identifying a plurality of z-mers in the plurality of local assemblies that have a high probability of being present in a genome of the organism, wherein z>k; and (b) bringing together z-mers of the plurality of z-mers in the plurality of local assemblies. 13. The method of claim 12 , wherein z is between 100 and 300. 14. The method of claim 1 , wherein the short-read sequence data is generated from less than 10 ng of DNA input material. 15. The method of claim 14 , wherein the short-read sequence data is generated from less than 2 ng of DNA input material. 16. The method of claim 1 , wherein the phased genome assembly is completed in less than 60 minutes. 17. The method of claim 1 , wherein the phased genome assembly is completed in less than 20 minutes. 18. The method of claim 1 , wherein the nucleic acid sequence data is whole genome sequence data and the phased genome assembly is a whole genome assembly, wherein the nucleic acid sequence is a deoxyribonucleic acid (DNA) sequence. 19. The method of claim 1 , where the short-read sequence data is tagged to preserve long-range sequence context over a starting sequence of 10 kilobases (kb) to 5 megabases (Mb). 20. A computer-implemented system comprising: a digital processing device comprising: at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create a de novo genome assembly application for nucleic acid sequence data generated from a nucleic acid sample of an organism, wherein the de novo genome assembly application is programmed to: (a) generate an initial assembly based on short-read sequence data, the initial assembly comprising one or more areas of unresolved sequence ambiguity, wherein the short-read sequence data is derived from longer starting sequences from the nucleic acid sequence data and is tagged to preserve long-range sequence context for the organism such that a subset of the short-read sequence data derived from a common starting sequence share one or more common tags; (b) generate a plurality of local assemblies based on the initial assembly by utilizing the one or more common tags to resolve the one or more areas of unresolved sequence ambiguity, wherein the plurality of local assemblies is generated by: (i) using the initial assembly as an interim reference; (ii) identifying edges of an unambiguous sequence; (iii) identifying neighboring edges sharing a number of the one or more common tags above a threshold number of common tags; and (iv) bringing together edges of the unambiguous sequence with the neighboring edges identified; (c) generate a global assembly based on the plurality of local assemblies; (d) clean the global assembly by removing sequence data inconsistent with the long-range sequence context indicated by the one or more common tags; and (e) generate a phased genome assembly based on the global assembly by utilizing the one or more common tags to separate a homologous phased nucleotide sequence, wherein the de novo genome assembly application is programmed to achieve the phased genome assembly without alignment to a reference sequence or any independently generated genome sequence. 21. The system of claim 20 , wherein the memory comprises less than 512 GB of storage. 2
Related publications grouped by family.
Answers are generated from the same data shown on this page.