Selection device for candidate sequence information for similarity determination, selection method, and use for such device and method
US-2015379197-A1 · Dec 31, 2015 · US
US10192029B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10192029-B2 |
| Application number | US-201514984109-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 30, 2015 |
| Priority date | May 13, 2011 |
| Publication date | Jan 29, 2019 |
| Grant date | Jan 29, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
System and methods are provided for performing privacy-preserving, high-performance, and scalable DNA read mapping on hybrid clouds including a public cloud and a private cloud. The systems and methods offer strong privacy protection and have the capacity to process millions of reads and allocate most of the workload to the public cloud at a small overall cost. The systems and methods perform seeding on the public cloud using keyed hash values of individual sequencing reads' seeds and then extend matched seeds on the private cloud. The systems and methods are designed to move the workload of read mapping from the extension stage to the seeding stage, thereby ensuring that the dominant portion of the overhead is shouldered by the public cloud.
Opening claim text (preview).
What is claimed is: 1. A method of mapping a plurality of DNA sequence reads to a reference genome, the method comprising: partitioning each of the plurality of DNA sequence reads into a plurality of seeds using computing resources of a private cloud; combining at least two seeds of the plurality of seeds to generate a combined seed using the private cloud computing resources; encrypting, by the private cloud computing resources, the combined seed using a keyed encryption algorithm to produce a keyed-hash value of the combined seed; transmitting the keyed hash value representing the combined seed from the private cloud computing resources to computing resources of a public cloud, wherein the keyed hash value is usable to search against a plurality of keyed hash values derived from a reference genome; receiving, by the private cloud computing resources, from the public cloud computing resources, data indicating positions where the reference genome matches the at least two seeds of the combined seed; and extending, using the private cloud computing resources, each of the at least two seeds at each of the positions where the reference genome matches the at least two seeds of the combined seed to determine whether the DNA sequence read corresponding to each of the at least two seeds aligns with the reference genome at that position. 2. The method of claim 1 , further comprising: dividing the reference genome into a plurality of substrings using the private cloud computing resources, each of the plurality of substrings and each of the plurality of seeds being of equal length; encrypting, by the private cloud computing resources, each unique substring of the plurality of substrings using a keyed encryption algorithm to produce a corresponding keyed-hash value for each unique substring of the plurality of substrings; and transmitting each of the corresponding keyed-hash values representing each of the unique substrings from the private cloud computing resources to the public cloud computing resources. 3. The method of claim 2 , further comprising: comparing the encrypted data representing the combined seed to groups of the encrypted data representing the unique substrings using the public cloud computing resources; and transmitting data indicating which group of the substrings matches the at least two seeds of the combined seed from the public cloud computing resources to the private cloud computing resources. 4. The method of claim 1 , wherein extending each of the at least two seeds at each of the positions where the reference genome matches the at least two seeds of the combined seed comprises determining whether the DNA sequence read corresponding to each of the at least two seeds matches the reference genome at the corresponding position with an edit distance less than or equal to an integer d. 5. The method of claim 4 , wherein partitioning each of the plurality of DNA sequence reads comprises partitioning each of the plurality of DNA sequence reads into (d+1) seeds. 6. The method of claim 5 , wherein each of the plurality of seeds is twenty or more base pairs in length. 7. The method of claim 4 , wherein partitioning each of the plurality of DNA sequence reads comprises partitioning each of the plurality of DNA sequence reads into (d+2) seeds. 8. The method of claim 7 , wherein each of the plurality of seeds is between ten and twenty base pairs in length. 9. One or more non-transitory, computer-readable media comprising a first plurality of instructions that, when executed by a first plurality of processors of a private cloud, causes the processors of the private cloud to: partition each of a plurality of DNA sequence reads into a plurality of seeds; combine at least two seeds of the plurality of seeds to generate a combined seed; encrypt the combined seed using a keyed encryption algorithm to produce a keyed-hash value of the combined seed, wherein the keyed hash value is usable to search against a plurality of keyed hash values derived from a reference genome; transmit the keyed hash value representing the combined seed to computing resources of a public cloud; receive, from the computing resources of the public cloud, data indicating positions where a reference genome matches the at least two seeds of the combined seed; and extend each of the at least two seeds at each of the positions where the reference genome matches the at least two seeds of the combined seed to determine whether the DNA sequence read corresponding to each of the at least two seeds aligns with the reference genome at that position. 10. The one or more non-transitory, computer-readable media of claim 9 , wherein the first plurality of instructions, when executed by the first plurality of processors, further causes the processors of the private cloud to: divide the reference genome into a plurality of substrings using the processors of the private cloud computing resources, each of the plurality of substrings and each of the plurality of seeds being of equal length; encrypt, by the processors of the private cloud computing, each unique substring of the plurality of substrings using a keyed encryption algorithm to produce a corresponding keyed-hash value for each unique substring of the plurality of substrings; and transmit each of the corresponding keyed-hash values representing each of the unique substrings from the processors of the private cloud computing to the public cloud computing resources. 11. The one or more non-transitory, computer-readable media of claim 10 , further comprising a second plurality of instructions that, when executed by a second plurality of processors of the public cloud, causes the processors of the public cloud to: compare the encrypted data representing the combined seed to groups of the encrypted data representing the unique substrings using the public cloud computing resources; and transmit data indicating which group of the substrings matches the at least two seeds of the combined seed from the public cloud computing resources to the processors of the private cloud computing. 12. The one or more non-transitory, computer-readable media of claim 9 , wherein the first plurality of instructions, when executed by the first plurality of processors, causes the processors of the private cloud to determine whether the DNA sequence read corresponding to each of the at least two seeds matches the reference genome at the corresponding position with an edit distance less than or equal to an integer d. 13. The one or more non-transitory, computer-readable media of claim 12 , wherein the first plurality of instructions, when executed by the first plurality of processors, causes the processors of the private cloud to partition each of the plurality of DNA sequence reads into (d+1) seeds. 14. The one or more non-transitory, computer-readable media of claim 13 , wherein the first plurality of instructions, when executed by the first plurality of processors, causes the processors of the private cloud to partition each of the plurality of DNA sequence reads into (d+1) seeds that are each twenty or more base pairs in length. 15. One or more non-transitory, computer-readable media comprising a plurality of instructions that, when executed by a plurality of processors of a private cloud, causes the processors of the private cloud to: partition a DNA sequence read into (d+2) seeds, where d is an integer; combine at least two seeds of the (d+2) seeds to generate a combined seed; encrypt the combined seed using a keyed encryption algorithm to produce a keyed-hash value of the combined seed, wherein the keyed hash v
Physics · mapped topic
wherein the data content is protected, e.g. by encrypting or encapsulating the payload · CPC title
Modes of operation, e.g. cipher block chaining [CBC], electronic codebook [ECB] or Galois/counter mode [GCM] · CPC title
involving non-keyed hash functions, e.g. modification detection codes [MDCs], MD5, SHA or RIPEMD · CPC title
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.