Biological sequence variant characterization
US-2016103954-A1 · Apr 14, 2016 · US
US10600499B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10600499-B2 |
| Application number | US-201615208656-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 13, 2016 |
| Priority date | Jul 13, 2016 |
| Publication date | Mar 24, 2020 |
| Grant date | Mar 24, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for identifying variations in sequence data relative to reference sequence data. The techniques include accessing information specifying multiple sets of variants in the sequence data relative to reference sequence data, each of the multiple sets of variants being generated by using a respective variant identification technique; and determining, using the information specifying the multiple sets of variants in the sequence data, a reconciled set of variants in the sequence data relative to the reference sequence data, the determining comprising: determining whether a first variant is present at a first position in the sequence data based, at least in part, on one or more variants at one or more other positions in the sequence data.
Opening claim text (preview).
What is claimed is: 1. A system for identifying variations in sequence data relative to reference sequence data specifying a reference genome, the system comprising: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform: aligning the sequence data to the reference sequence data specifying the reference genome to obtain aligned sequence data; determining information specifying multiple sets of variants in the sequence data relative to the reference sequence data specifying the reference genome at least in part by: applying a first variant identification technique to the aligned sequence data to obtain a first set of variants of the multiple sets of variants; and applying a second variant identification technique to the aligned sequence data to obtain a second set of variants of the multiple sets of variants, wherein the first variant identification technique is different from the second variant identification technique; determining, using the information specifying the multiple sets of variants in the sequence data, a reconciled set of variants in the sequence data relative to the reference sequence data specifying the reference genome, the determining comprising: accessing a statistical model of variant dynamics across positions of the reference sequence data specifying the reference genome, wherein the statistical model encodes information indicating a probability that a first variant associated with a first set of characteristics is present at a first position in the reference sequence data specifying the reference genome based on a second variant associated with a second set of characteristics being present at a second position in the reference sequence data specifying the reference genome; using a Viterbi algorithm or a forward-backward algorithm for hidden Markov models to determine whether a first variant of the multiple sets of variants is present at a first position in the sequence data based, at least in part, on the statistical model and one or more variants at one or more other positions in the sequence data. 2. The system of claim 1 , wherein determining whether the first variant of the multiple sets of variants is present at the first position comprises determining whether there is an insertion, a deletion, a single nucleotide polymorphism, or an inversion present at the first position in the sequence data or whether there is no variation at the first position in the sequence data relative to the reference sequence data specifying the reference genome. 3. The system of claim 1 , wherein the determining comprises determining information specifying a third set of variants in the sequence data generated by using a third respective variant identification technique, wherein the third variant identification technique is different from the first and second variant identification techniques. 4. The system of claim 3 , wherein the determining comprises: determining information specifying a fourth set of variants in the sequence data generated by using a fourth variant identification technique; and determining information specifying a fifth set of variants in the sequence data generated by using a fifth variant identification technique. 5. The system of claim 1 , wherein determining the reconciled set of variants comprises selecting each variant in the reconciled set of variants from the multiple sets of variants. 6. The system of claim 1 , wherein determining the reconciled set of variants comprises identifying no more than one variant for each position in the sequence data. 7. The system of claim 1 , wherein the processor-executable instructions further cause the at least one computer hardware processor to perform: estimating, using training sequence data, the statistical model of variant dynamics, comprising estimating the probability that the first variant associated with the first set of characteristics is present at the first position in the reference sequence data specifying the reference genome based on the second variant associated with the second set of characteristics being present at the second position in the reference sequence data specifying the reference genome. 8. The system of claim 1 , wherein determining the first variant of the multiple sets of variants at the first position is performed by using a likelihood that some variant is present at the first position in the sequence data given that a variant is present at a second position in the sequence data, wherein the second position precedes the first position in the sequence data. 9. The system of claim 1 , wherein determining the first variant of the multiple sets of variants at the first position is performed by using a likelihood that a variant of a first type is present at the first position in the sequence data given that a variant of a second type is present at a second position in the sequence data, wherein the second position precedes the first position in the sequence data. 10. The system of claim 1 , wherein determining the first variant of the multiple sets of variants at the first position is performed based, at least in part, on a measure of a true positive rate and/or a false negative rate of the first variant identification technique for a particular type of variant. 11. The system of claim 10 , wherein the processor-executable instructions further cause the at least one computer hardware processor to perform: estimating the true positive rate and/or the false negative rate of the first variant identification technique for the particular type of variant. 12. The system of claim 1 , wherein the statistical model encodes information indicating, for each position of a set of all positions in the reference sequence data specifying the reference genome, a probability of a first type of variant being present at the position based on a second type of variant being present at a different position in the set of all positions in the reference sequence data specifying the reference genome. 13. The system of claim 12 , further comprising estimating the statistical model of variant dynamics from sequence data, comprising estimating, for each position of the set of all positions in the reference sequence data specifying the reference genome, the probability of the first type of variant being present at the position based on the second type of variant being present at the different position. 14. The system of claim 1 , wherein using the Viterbi algorithm or forward-backward algorithm for hidden Markov models comprises: determining the first variant of the multiple sets of variants is present at the first position; determining a second variant of the multiple sets of variants is present at a second position based, at least in part, on the statistical model and the first variant. 15. The system of claim 14 , further comprising determining a third variant of the multiple sets of variants is present at a third position based, at least in part, on the statistical model and the second variant. 16. The system of claim 1 , wherein using the Viterbi algorithm or forward-backward algorithm for hidden Markov models comprises: determining a first set of probabilities for the first position, comprising determining each probability of the first set based on (a) an associated possible variant in a set of possible variants and (b) the one or more variants at the one or more other positions in the sequence
Related publications grouped by family.
Answers are generated from the same data shown on this page.