Method, computer-accessible medium and systems for score-driven whole-genome shotgun sequence assemble

US10839940B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10839940-B2
Application numberUS-200913139809-A
CountryUS
Kind codeB2
Filing dateDec 23, 2009
Priority dateDec 24, 2008
Publication dateNov 17, 2020
Grant dateNov 17, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Exemplary embodiments of the present disclosure relate generally to methods, computer-accessible medium and systems for assembling haplotype and/or genotype sequences of at least one genome, which can be based upon, e.g., consistent layouts of short sequence reads and long-range genome related data. For example, a processing arrangement can be configured to perform a procedure including, e.g., obtaining randomly located short sequence reads, using at least one score function in combination with constraints based on, e.g., the long range data, generating a layout of randomly located short sequence reads such that the layout is globally optimal with respect to the score function, obtained through searching coupled with score and constraint dependent pruning to determine the globally optimal layout substantially satisfying the constraints, generating a whole and/or a part of a genome wide haplotype sequence and/or genotype sequence, and converting a globally optimal layout into one or more consensus sequences.

First claim

Opening claim text (preview).

What is claimed is: 1. A non-transitory computer-accessible medium having stored thereon computer executable instructions for assembling at least one part of at least one of at least one haplotype sequence or at least one genotype sequence of at least one genome, wherein, when the executable instructions are executed by a computer processing arrangement, the processing arrangement is configured to perform at least one procedure comprising: (a) obtaining (i) a plurality of randomly located short sequence reads, and (ii) overlap information about overlaps between the randomly located short sequence reads; (b) obtaining long range information for the randomly located short sequence reads, wherein the long range information includes optical map data and mate-pair data; (c) automatically randomly selecting a first read from the randomly located short sequence reads; (d) automatically identifying one or more overlapping second reads of the randomly located short sequence reads that overlap with the first read; (e) automatically generating one or more scores for the one or more overlapping second reads using the overlap information and the long range information; (f) selecting a particular read of the one or more second overlapping reads based on the one or more scores; (g) automatically generating a path through the plurality of randomly located short sequence reads by repeating procedures (e) and (f); and (h) automatically assembling the at least one part of the at least one of the at least one haplotype sequence or the at least one genotype sequence of the genome based on the path. 2. The computer-accessible medium of claim 1 , wherein the processing arrangement is further configured to generate the one or more scores based on at least one of a containment or an overhang among a single pair of the randomly located short sequence reads. 3. The computer-accessible medium of claim 2 , wherein the processing arrangement is further configured to evaluate the at least one of the containment or the overhang using at least one of (i) an orientation of the randomly located short sequence reads, (ii) a location of the randomly located short sequence reads, or (iii) a haplotypic identity of the randomly located short sequence reads. 4. The computer-accessible medium of claim 1 , wherein the processing arrangement is further configured to generate the one or more scores using a weighted transitivity score. 5. The computer-accessible medium of claim 1 , wherein the processing arrangement is further configured to generate the one or more scores using a Bayesian likelihood. 6. The computer-accessible medium of claim 5 , wherein the Bayesian likelihood is based on at least one penalty function. 7. The computer-accessible medium of claim 1 , wherein the processing arrangement is further configured to generate the one or more scores based on a plurality of homologous reference sequences. 8. The computer-accessible medium of claim 1 , wherein the processing arrangement is further configured to generate the one or more scores based on short range information. 9. The computer-accessible medium of claim 1 , wherein the processing arrangement is further configured to prune at least one of the paths. 10. The computer-accessible medium of claim 9 , wherein the processing arrangement is configured to prune the at least one of the paths based on the one or more scores. 11. The computer-accessible medium of claim 9 , wherein the processing arrangement is configured to prune the at least one of the paths based on the overlap information. 12. The computer-accessible medium of claim 9 , wherein the processing arrangement is configured to prune the at least one of the paths based on a maximum number of candidate paths allowed in a queue. 13. The computer-accessible medium of claim 12 , wherein the maximum number of candidate paths allowed in the queue is fixed. 14. The computer-accessible medium of claim 9 , wherein the processing arrangement is configured to prune the at least one of the paths based on a percentage of top ranking paths compared to an optimum score. 15. The computer-accessible medium of claim 14 , wherein the percentage of top ranking paths compared to an optimum score dynamically changes over time. 16. The computer-accessible medium of claim 1 , wherein the processing arrangement is further configured to obtain the randomly located short sequence reads using at least one of (i) Sanger chemistry, (ii) sequencing-by-synthesis, (iii) sequencing-by-hybridization, or (iv) sequencing-by-ligation. 17. The computer-accessible medium of claim 1 , wherein the processing arrangement is further configured to obtain the randomly located short sequence reads using at least one method having at least one error, wherein the at least one error is at least one of: (i) incorrect base-calls, (ii) missing bases, (iii) inserted bases, or (iv) homopolymeric compression. 18. The computer-accessible medium of claim 1 , wherein the long-range information further includes a physical map that is at least one of (i) an ordered restriction map, (ii) a probe map, or (iii) a base-distribution map. 19. The computer-accessible medium of claim 1 , wherein the processing arrangement is further configured to evaluate the scoring procedure based on a consistency of the one or more scores with respect to the long-range information by determining a local alignment with an alignment score. 20. The computer-accessible medium of claim 1 , wherein the randomly located short sequence reads are generated using at least one procedure having at least one error, and wherein the at least one error is at least one of: (i) incorrect base-calls, (ii) missing bases, (iii) inserted bases, (iv) homopolymeric compression or (v) expansion. 21. The computer-accessible medium of claim 1 , wherein the long-range comprises approximately 10 Kb-200 mb of information associated with the at least one genome. 22. A method for assembling at least one part of at least one of at least one haplotype sequence or at least one genotype sequence of at least one genome, comprising: (a) obtaining (i) a plurality of randomly located short sequence reads, and (ii) overlap information about overlaps between the randomly located short sequence reads; (b) obtaining long range information for the randomly located short sequence reads, wherein the long range information includes optical map data and mate-pair data; (c) automatically randomly selecting a first read from the randomly located short sequence reads; (d) automatically identifying one or more overlapping second reads of the randomly located short sequence reads that overlap with the first read; (e) automatically generating one or more scores regarding the one or more overlapping second reads using the overlap information and the long range information; (f) selecting a particular read of the one or more second overlapping reads based on the one or more scores; (g) automatically generating a path through the plurality of randomly located short sequence reads by repeating procedures (e) and (f); and (h) using a computer hardware arrangement, automatically assembling the at least one part of the at least one of the at least one haplotype sequence or the at least one genotype sequence of the genome based on the path. 23. The method of claim 22 , further comprising generating the one or more scores based on at least one of a containment or an overhang among a single pair of the randomly located

Assignees

Inventors

Classifications

  • G16B30/20Primary

    Sequence assembly · CPC title

  • Sequence alignment; Homology search · CPC title

  • G16B30/00Primary

    ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10839940B2 cover?
Exemplary embodiments of the present disclosure relate generally to methods, computer-accessible medium and systems for assembling haplotype and/or genotype sequences of at least one genome, which can be based upon, e.g., consistent layouts of short sequence reads and long-range genome related data. For example, a processing arrangement can be configured to perform a procedure including, e.g., …
Who is the assignee on this patent?
Mishra Bhubaneswar, Narzisi Giuseppe, Univ New York
What technology area does this patent fall under?
Primary CPC classification G16B30/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 17 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).