What technology area does this patent fall under?

Primary CPC classification G16B30/20. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 03 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Systems, methods, and media for de novo assembly of whole genome sequence data

US11081208B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11081208-B2
Application number	US-201615242256-A
Country	US
Kind code	B2
Filing date	Aug 19, 2016
Priority date	Feb 11, 2016
Publication date	Aug 3, 2021
Grant date	Aug 3, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described are computer-implemented methods, systems, and media for de novo phased diploid assembly of nucleic acid sequence data generated from a nucleic acid sample of an individual utilizing nucleic acid tags to preserve long-range sequence context for the individual such that a subset of short-read sequence data derived from a common starting sequence shares a common tag. The phased diploid assembly is achieved without alignment to a reference sequence derived from organisms other than the individual. The methods, systems, and media described are computer-resource efficient, allowing scale-up.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method of de novo genome assembly for nucleic acid sequence data generated from a nucleic acid sample of an organism, the method comprising: (a) generating, by one or more computers, an initial assembly based on short-read sequence data, the initial assembly comprising one or more areas of unresolved sequence ambiguity, wherein the short-read sequence data is derived from longer starting sequences from the nucleic acid sequence data and is tagged to preserve long-range sequence context for the organism such that a subset of the short-read sequence data derived from a common starting sequence share one or more common tags; (b) generating, by the one or more computers, a plurality of local assemblies based on the initial assembly by utilizing the one or more common tags to resolve the one or more areas of unresolved sequence ambiguity, wherein the plurality of local assemblies is generated by: (i) using the initial assembly as an interim reference; (ii) identifying edges of an unambiguous sequence; (iii) identifying neighboring edges sharing a number of the one or more common tags above a threshold number of common tags; and (iv) bringing together edges of the unambiguous sequence with the neighboring edges identified; (c) generating, by the one or more computers, a global assembly based on the plurality of local assemblies; (d) cleaning, by the one or more computers, the global assembly by removing sequence data inconsistent with the long-range sequence context indicated by the one or more common tags; and (e) generating, by the one or more computers, a phased genome assembly based on the global assembly by utilizing the one or more common tags to separate a phased nucleotide sequence; wherein the phased genome assembly is achieved without alignment to a reference sequence or any independently generated genome sequence. 2. The method of claim 1 , wherein the phased genome assembly is for a diploid genome. 3. The method of claim 1 , wherein the short-read sequence data is generated from a single library. 4. The method of claim 1 , wherein the short-read sequence data results in 50× or less coverage of a genome of the organism. 5. The method of claim 1 , wherein the short-read sequence data is tagged to preserve context over a starting sequence 2×-1000× longer than reads from the short-read sequence data. 6. The method of claim 1 , wherein the initial assembly is an initial assembly graph. 7. The method of claim 6 , wherein the initial assembly graph is generated by: (a) identifying a plurality of k-mers that have a high probability of being present in a genome of the organism; (b) using the one or more common tags to filter the plurality of k-mers based on a number of starting sequences each k-mer occurs in; and (c) bringing together k-mers in the plurality of k-mers sharing a common 1-mer to form an initial assembly, wherein 1<k. 8. The method of claim 7 , further comprising applying, by the one or more computers, a preliminary filter prior to generating the initial assembly, wherein the preliminary filter comprises: (a) utilization of base quality scores from a sequencer used to generate the short-read sequence data, and (b) utilization of k-mers that occur more than once and the one or more common tags, such that each k-mer must be seen arising from two distinct common tags. 9. The method of claim 8 , further comprising applying, by the one or more computers, lossless random access compression to each record of the base quality scores and paths through the graph. 10. The method of claim 7 , wherein method further comprises revising, by the one or more computers, the initial assembly graph by: (a) eliminating the one or more areas of unresolved sequence ambiguity based on a number of reads available for each option within an area of unresolved sequence ambiguity; and (b) filling in gaps in the initial assembly graph by consulting the short-read sequence data. 11. The method of claim 7 , wherein k is between 24 and 96. 12. The method of claim 7 , wherein the global assembly is generated by: (a) identifying a plurality of z-mers in the plurality of local assemblies that have a high probability of being present in a genome of the organism, wherein z>k; and (b) bringing together z-mers of the plurality of z-mers in the plurality of local assemblies. 13. The method of claim 12 , wherein z is between 100 and 300. 14. The method of claim 1 , wherein the short-read sequence data is generated from less than 10 ng of DNA input material. 15. The method of claim 14 , wherein the short-read sequence data is generated from less than 2 ng of DNA input material. 16. The method of claim 1 , wherein the phased genome assembly is completed in less than 60 minutes. 17. The method of claim 1 , wherein the phased genome assembly is completed in less than 20 minutes. 18. The method of claim 1 , wherein the nucleic acid sequence data is whole genome sequence data and the phased genome assembly is a whole genome assembly, wherein the nucleic acid sequence is a deoxyribonucleic acid (DNA) sequence. 19. The method of claim 1 , where the short-read sequence data is tagged to preserve long-range sequence context over a starting sequence of 10 kilobases (kb) to 5 megabases (Mb). 20. A computer-implemented system comprising: a digital processing device comprising: at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create a de novo genome assembly application for nucleic acid sequence data generated from a nucleic acid sample of an organism, wherein the de novo genome assembly application is programmed to: (a) generate an initial assembly based on short-read sequence data, the initial assembly comprising one or more areas of unresolved sequence ambiguity, wherein the short-read sequence data is derived from longer starting sequences from the nucleic acid sequence data and is tagged to preserve long-range sequence context for the organism such that a subset of the short-read sequence data derived from a common starting sequence share one or more common tags; (b) generate a plurality of local assemblies based on the initial assembly by utilizing the one or more common tags to resolve the one or more areas of unresolved sequence ambiguity, wherein the plurality of local assemblies is generated by: (i) using the initial assembly as an interim reference; (ii) identifying edges of an unambiguous sequence; (iii) identifying neighboring edges sharing a number of the one or more common tags above a threshold number of common tags; and (iv) bringing together edges of the unambiguous sequence with the neighboring edges identified; (c) generate a global assembly based on the plurality of local assemblies; (d) clean the global assembly by removing sequence data inconsistent with the long-range sequence context indicated by the one or more common tags; and (e) generate a phased genome assembly based on the global assembly by utilizing the one or more common tags to separate a homologous phased nucleotide sequence, wherein the de novo genome assembly application is programmed to achieve the phased genome assembly without alignment to a reference sequence or any independently generated genome sequence. 21. The system of claim 20 , wherein the memory comprises less than 512 GB of storage. 2

Assignees

10X Genomics Inc

Inventors

Classifications

G16B30/10
Sequence alignment; Homology search · CPC title
G16B30/20Primary
Sequence assembly · CPC title
G16B30/00Primary
ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

Patent family

Related publications grouped by family.

View patent family 59561717

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11081208B2 cover?: Described are computer-implemented methods, systems, and media for de novo phased diploid assembly of nucleic acid sequence data generated from a nucleic acid sample of an individual utilizing nucleic acid tags to preserve long-range sequence context for the individual such that a subset of short-read sequence data derived from a common starting sequence shares a common tag. The phased diploid …
Who is the assignee on this patent?: 10X Genomics Inc
What technology area does this patent fall under?: Primary CPC classification G16B30/20. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 03 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).