Set membership testers for aligning nucleic acid samples

US9845552B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9845552-B2
Application numberUS-201214354528-A
CountryUS
Kind codeB2
Filing dateOct 18, 2012
Priority dateOct 27, 2011
Publication dateDec 19, 2017
Grant dateDec 19, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed are methods and tools for rapidly aligning reads to a reference sequence. These methods and tools employ Bloom filters or similar set membership testers to perform the alignment. The reads may be short sequences of nucleic acids or other biological molecules and the reference sequences may be sequences of genomes, chromosomes, etc. The Bloom filters include a collection of hash functions, a bit array, and associated logic for applying reads to the filter. Each filter, and there may be multiple of these used in a particular application, is used to determine whether an applied read is present in a reference sequence. Each Bloom filter is associated with a single reference sequence such as the sequence of a particular chromosome. In one example, chromosomal abundance is determined by aligning reads from a sequencer to multiple chromosomes, each having an associated Bloom filter or other set membership tester.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, implemented on a computer system comprising one or more processors and system memory, for detecting copy number variations, the method comprising: (a) receiving, by the computer system, a plurality of reads obtained from a sample; (b) providing, on the computer system, a plurality of One Read Bloom filters corresponding to a plurality of regions of a genome and a plurality of Multiple Read Bloom filters corresponding to the plurality of regions of the genome, wherein each Bloom filter comprises a bit array, one or more hash functions, and logic for applying reads to the Bloom filter, each One Read Bloom filter was constructed using approximately read-sized sequences in its corresponding region of the genome, and each Multiple Read Bloom filter was constructed using approximately read-sized sequences found more than once in its corresponding region of the genome; (c) applying, by the one or more processors, each read of the plurality of reads to each One Read Bloom filter to determine a membership of each read in each One Read Bloom filter, wherein applying a read to a Bloom filter comprises: providing the read as an input to each hash function of the one or more hash functions of the Bloom filter, obtaining an output value from each hash function and the read, wherein each output value is associated with a bit position in the bit array of the Bloom filter, and determining that the read is a member of the Bloom filter based on bit values of the bit array at bit positions associated with output values obtained from the one or more hash functions and the read; (d) applying, by the one or more processors, each read of the plurality of reads to each Multiple Read Bloom filter to determine a membership of each read in each Multiple Read Bloom filter; (e) determining, based on the memberships determined in (c) and (d) and by the one or more processors, which one or more regions of the plurality of regions the reads are aligned to; (f) determining, by the one or more processors, from a number of reads aligned to each of the plurality of regions of the genome, read abundance values of the plurality of regions of the genome; (g) comparing, on a region-to-region basis and by the one or more processors, a read abundance value of each region of the plurality of regions of the genome to a threshold number to produce one or more statistical values indicating aberrations of read abundance in one or more regions of the plurality of regions; and (h) making, based on the one or more statistical values, one or more detection calls of copy number variation in one or more of the plurality of regions of the genome. 2. The method of claim 1 , wherein determining the read abundance values of the plurality of regions comprises excluding a read from any of the plurality of regions when the read is a member of two or more filters of the plurality of One Read Bloom filters. 3. The method of claim 1 , wherein the plurality of regions of the genome corresponds to a plurality of chromosomes of an organism, and the copy number variation comprises a chromosomal aneuploidy. 4. The method of claim 1 , wherein the sample comprises a mixture of genomes. 5. The method of claim 4 , wherein the sample comprises cells taken from a pregnant individual. 6. The method of claim 1 , wherein at least one of the Bloom filters comprises 9 or 10 hash functions. 7. The method of claim 6 , wherein the hash functions require at most about 5 machine instructions to hash a character. 8. The method of claim 1 , wherein at least one of the Bloom filters comprises a bit array having between about 1.5×10 10 to 8.5×10 11 bit positions. 9. The method of claim 1 , wherein at least one of the Bloom filters has a false positive probability of at most about 0.00001. 10. The method of claim 1 , wherein the plurality of regions of a genome are portions of chromosomes, and the copy number variation comprises a partial chromosomal aneuploidy. 11. The method of claim 1 , further comprising applying the plurality of reads to an exclusion region Bloom filter to determine whether any reads should be excluded from alignment to any regions. 12. The method of claim 1 , wherein at least one Multiple Read Bloom filter of the plurality of Multiple Read Bloom filters was constructed using repeated sequences. 13. The method of claim 12 , wherein the repeated sequences are located in the at least one Multiple Read Bloom filter's corresponding region of the genome. 14. The method of claim 1 , wherein at least one filter of the plurality of One Read Bloom filters was constructed using approximately read-sized sequences in its corresponding region of the genome but not in one or more exclusion regions in its corresponding region of the genome. 15. The method of claim 14 , wherein at least one Multiple Read Bloom filter of the plurality of Multiple Read Bloom filters was constructed using approximately read-sized sequences in the one or more exclusion regions, as well as approximately read-sized sequences found more than once in its corresponding region of the genome. 16. The method of claim 1 , wherein (e) comprises determining a read is aligned to a region when the read is a member of a One Read Bloom filter of the region and is not a member of any of the plurality of Multiple Read Bloom filters. 17. The method of claim 1 , wherein (e) comprises determining a read is aligned to a region when the read is a member of a One Read Bloom filter of the region and is not a member of any of the plurality of Multiple Read Bloom filters or any other One Read Bloom filters. 18. The method of claim 1 , wherein the plurality of regions of the genome corresponds to a plurality chromosome strands, sub-strand regions, or custom regions. 19. The method of claim 1 , wherein the approximately read-sized sequences fit into one or more read sizes of the plurality of reads. 20. A computer program product for detecting copy number variations, the computer program product comprising a non-transitory computer readable medium on which is provided program instructions comprising: (a) code for receiving a plurality of reads obtained from a sample; (b) code for providing a plurality of One Read Bloom filters corresponding to a plurality of regions of a genome and a plurality of Multiple Read Bloom filters corresponding to the plurality of regions of the genome, wherein each Bloom filter comprises a bit array, one or more hash functions, and logic for applying reads to the Bloom filter, each One Read Bloom filter was constructed using approximately read-sized sequences in its corresponding region of the genome, and each Multiple Read Bloom filter was constructed using approximately read-sized sequences found more than once in its corresponding region of the genome; (c) code for applying each read of the plurality of reads to each One Read Bloom filter to determine a membership of each read in each One Read Bloom filter, wherein applying a read to a Bloom filter comprises: providing the read as an input to each hash function of the one or more hash functions of the Bloom filter, obtaining an output value from each hash function and the read, wherein each output value is associated with a bit position in the bit array of the Bloom filter, and determining that the read is a member of the Bloom filter based on bit values of the bit array at bit positions associated with output values obtained from the one or more hash functions and the read; (d) code for applying each read of t

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • C40B30/02Primary

    Chemistry & Metallurgy · mapped topic

  • ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides · CPC title

  • G16B30/10Primary

    Sequence alignment; Homology search · CPC title

  • Ploidy or copy number detection · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9845552B2 cover?
Disclosed are methods and tools for rapidly aligning reads to a reference sequence. These methods and tools employ Bloom filters or similar set membership testers to perform the alignment. The reads may be short sequences of nucleic acids or other biological molecules and the reference sequences may be sequences of genomes, chromosomes, etc. The Bloom filters include a collection of hash functi…
Who is the assignee on this patent?
Verinata Health Inc
What technology area does this patent fall under?
Primary CPC classification C40B30/02. Mapped technology areas include Chemistry & Metallurgy.
When was this patent published?
Publication date Tue Dec 19 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).