Systems and methods for reconciling variants in sequence data relative to reference sequence data

US10600499B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10600499-B2
Application numberUS-201615208656-A
CountryUS
Kind codeB2
Filing dateJul 13, 2016
Priority dateJul 13, 2016
Publication dateMar 24, 2020
Grant dateMar 24, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for identifying variations in sequence data relative to reference sequence data. The techniques include accessing information specifying multiple sets of variants in the sequence data relative to reference sequence data, each of the multiple sets of variants being generated by using a respective variant identification technique; and determining, using the information specifying the multiple sets of variants in the sequence data, a reconciled set of variants in the sequence data relative to the reference sequence data, the determining comprising: determining whether a first variant is present at a first position in the sequence data based, at least in part, on one or more variants at one or more other positions in the sequence data.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for identifying variations in sequence data relative to reference sequence data specifying a reference genome, the system comprising: at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform: aligning the sequence data to the reference sequence data specifying the reference genome to obtain aligned sequence data; determining information specifying multiple sets of variants in the sequence data relative to the reference sequence data specifying the reference genome at least in part by: applying a first variant identification technique to the aligned sequence data to obtain a first set of variants of the multiple sets of variants; and applying a second variant identification technique to the aligned sequence data to obtain a second set of variants of the multiple sets of variants, wherein the first variant identification technique is different from the second variant identification technique; determining, using the information specifying the multiple sets of variants in the sequence data, a reconciled set of variants in the sequence data relative to the reference sequence data specifying the reference genome, the determining comprising: accessing a statistical model of variant dynamics across positions of the reference sequence data specifying the reference genome, wherein the statistical model encodes information indicating a probability that a first variant associated with a first set of characteristics is present at a first position in the reference sequence data specifying the reference genome based on a second variant associated with a second set of characteristics being present at a second position in the reference sequence data specifying the reference genome; using a Viterbi algorithm or a forward-backward algorithm for hidden Markov models to determine whether a first variant of the multiple sets of variants is present at a first position in the sequence data based, at least in part, on the statistical model and one or more variants at one or more other positions in the sequence data. 2. The system of claim 1 , wherein determining whether the first variant of the multiple sets of variants is present at the first position comprises determining whether there is an insertion, a deletion, a single nucleotide polymorphism, or an inversion present at the first position in the sequence data or whether there is no variation at the first position in the sequence data relative to the reference sequence data specifying the reference genome. 3. The system of claim 1 , wherein the determining comprises determining information specifying a third set of variants in the sequence data generated by using a third respective variant identification technique, wherein the third variant identification technique is different from the first and second variant identification techniques. 4. The system of claim 3 , wherein the determining comprises: determining information specifying a fourth set of variants in the sequence data generated by using a fourth variant identification technique; and determining information specifying a fifth set of variants in the sequence data generated by using a fifth variant identification technique. 5. The system of claim 1 , wherein determining the reconciled set of variants comprises selecting each variant in the reconciled set of variants from the multiple sets of variants. 6. The system of claim 1 , wherein determining the reconciled set of variants comprises identifying no more than one variant for each position in the sequence data. 7. The system of claim 1 , wherein the processor-executable instructions further cause the at least one computer hardware processor to perform: estimating, using training sequence data, the statistical model of variant dynamics, comprising estimating the probability that the first variant associated with the first set of characteristics is present at the first position in the reference sequence data specifying the reference genome based on the second variant associated with the second set of characteristics being present at the second position in the reference sequence data specifying the reference genome. 8. The system of claim 1 , wherein determining the first variant of the multiple sets of variants at the first position is performed by using a likelihood that some variant is present at the first position in the sequence data given that a variant is present at a second position in the sequence data, wherein the second position precedes the first position in the sequence data. 9. The system of claim 1 , wherein determining the first variant of the multiple sets of variants at the first position is performed by using a likelihood that a variant of a first type is present at the first position in the sequence data given that a variant of a second type is present at a second position in the sequence data, wherein the second position precedes the first position in the sequence data. 10. The system of claim 1 , wherein determining the first variant of the multiple sets of variants at the first position is performed based, at least in part, on a measure of a true positive rate and/or a false negative rate of the first variant identification technique for a particular type of variant. 11. The system of claim 10 , wherein the processor-executable instructions further cause the at least one computer hardware processor to perform: estimating the true positive rate and/or the false negative rate of the first variant identification technique for the particular type of variant. 12. The system of claim 1 , wherein the statistical model encodes information indicating, for each position of a set of all positions in the reference sequence data specifying the reference genome, a probability of a first type of variant being present at the position based on a second type of variant being present at a different position in the set of all positions in the reference sequence data specifying the reference genome. 13. The system of claim 12 , further comprising estimating the statistical model of variant dynamics from sequence data, comprising estimating, for each position of the set of all positions in the reference sequence data specifying the reference genome, the probability of the first type of variant being present at the position based on the second type of variant being present at the different position. 14. The system of claim 1 , wherein using the Viterbi algorithm or forward-backward algorithm for hidden Markov models comprises: determining the first variant of the multiple sets of variants is present at the first position; determining a second variant of the multiple sets of variants is present at a second position based, at least in part, on the statistical model and the first variant. 15. The system of claim 14 , further comprising determining a third variant of the multiple sets of variants is present at a third position based, at least in part, on the statistical model and the second variant. 16. The system of claim 1 , wherein using the Viterbi algorithm or forward-backward algorithm for hidden Markov models comprises: determining a first set of probabilities for the first position, comprising determining each probability of the first set based on (a) an associated possible variant in a set of possible variants and (b) the one or more variants at the one or more other positions in the sequence

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • G16B30/00Primary

    ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Sequence alignment; Homology search · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10600499B2 cover?
Techniques for identifying variations in sequence data relative to reference sequence data. The techniques include accessing information specifying multiple sets of variants in the sequence data relative to reference sequence data, each of the multiple sets of variants being generated by using a respective variant identification technique; and determining, using the information specifying the m…
Who is the assignee on this patent?
Seven Bridges Genomics Inc
What technology area does this patent fall under?
Primary CPC classification G16B30/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 24 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).