Systems and methods for determining structural variation and phasing using variant call data

US2016232291A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016232291-A1
Application numberUS-201615019928-A
CountryUS
Kind codeA1
Filing dateFeb 9, 2016
Priority dateFeb 9, 2015
Publication dateAug 11, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for determining structural variation and phasing using variant call data obtained from nucleic acid of a biological sample are provided. Sequence reads are obtained, each comprising a portion corresponding to a subset of the test nucleic acid and a portion encoding a barcode independent of the sequencing data. Bin information is obtained. Each bin represents a different portion of the sample nucleic acid. Each bin corresponds to a set of sequence reads in a plurality of sets of sequence reads formed from the sequence reads such that each sequence read in a respective set of sequence reads corresponds to a subset of the nucleic acid represented by the bin corresponding to the respective set. Binomial tests identify bin pairs having more sequence reads with the same barcode in common than expected by chance. Probabilistic models determine structural variation likelihood from the sequence reads of these bin pairs.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of determining a likelihood of a structural variation occurring in a test nucleic acid obtained from a single biological sample, the method comprising: at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors: (A) obtaining a plurality of sequence reads from a plurality of sequencing reactions in which the test nucleic acid is fragmented, wherein each respective sequence read in the plurality of sequence reads comprises a first portion that corresponds to a subset of the test nucleic acid and a second portion that encodes a respective barcode for the respective sequence read in a plurality of barcodes, and each respective barcode is independent of the sequencing data of the test nucleic acid, and the plurality of sequence reads collectively include the plurality of barcodes; (B) obtaining bin information for a plurality of bins, wherein each respective bin in the plurality of bins represents a different portion of the test nucleic acid, the bin information identifies, for each respective bin in the plurality of bins, a set of sequence reads in a plurality of sets of sequence reads that are in the plurality of sequence reads, and the respective first portion of each respective sequence read in each respective set of sequence reads in the plurality of sets of sequence reads corresponds to a subset of the test nucleic acid that at least partially overlaps the different portion of the test nucleic acid that is represented by the bin corresponding to the respective set of sequence reads; (C) identifying, from among the plurality of bins, a first bin and a second bin that correspond to portions of the test nucleic acid that are nonoverlapping, wherein the first bin is represented by a first set of sequence reads in the plurality of sequence reads and the second bin is represented by a second set of sequence reads in the plurality of sequence reads; (D) determining a first value that represents a numeric probability or likelihood that the number of barcodes common to the first set and the second set is attributable to chance; (E) responsive to a determination that the first value satisfies a predetermined cut-off value, for each barcode that is common to the first bin and the second bin, obtaining a fragment pair thereby obtaining one or more fragment pairs, each fragment pair in the one or more fragment pairs (i) corresponding to a different barcode that is common to the first bin and the second bin and (ii) consisting of a different first calculated fragment and a different second calculated fragment, wherein, for each respective fragment pair in the one or more fragment pairs: the different first calculated fragment consists of a respective first subset of sequence reads in the plurality of sequence reads having the barcode corresponding to the respective fragment pair, wherein each sequence read in the respective first subset of sequence reads is within a predefined genetic distance of another sequence read in the respective first subset of sequence reads, the different first calculated fragment of the respective fragment pair originates with a first sequence read having the barcode corresponding to the respective fragment pair in the first bin, and each sequence read in the respective first subset of sequence reads is from the first bin, and the different second calculated fragment consists of a respective second subset of sequence reads in the plurality of sequence reads having the barcode corresponding to the respective fragment pair, wherein each sequence read in the respective second subset of sequence reads is within a predefined genetic distance of another sequence read in the respective second subset of sequence reads, the different second calculated fragment of the respective fragment pair originates with a second sequence read having the barcode corresponding to the respective fragment pair in the second bin, and each sequence read in the respective second subset of sequence reads is from the second bin; and (F) computing a respective likelihood based upon a probability of occurrence of a first model and a probability of occurrence of a second model regarding the one or more fragment pairs to thereby provide a likelihood of a structural variation in the test nucleic acid, wherein (i) the first model specifies that the respective first calculated fragments and the respective second calculated fragments of the one or more fragment pairs are observed given no structural variation in the target nucleic acid sequence and are part of a common molecule, and (ii) the second model specifies that the respective first calculated fragments and the respective second calculated fragments of the one or more fragment pairs are observed given structural variation in the target nucleic acid sequence. 2 . The method of claim 1 , wherein the first bin and the second bin are at least 50 kilobases apart on the test nucleic acid. 3 . The method of claim 1 , wherein the determining (D) uses a binomial test to compute the first value of the form: p= 1− P Binom ( n;n 1 n 2 /B ) wherein, p is the first value, expressed as a p-value, n is the number of unique barcodes that is found in both in the first and second set of sequence reads, n 1 is the number of unique barcodes in the first set of sequence reads, n 2 is the number of unique barcodes in the second set of sequence reads, and B is the total number of unique barcodes across the plurality of bins. 4 . The method of claim 1 , wherein the single biological sample is human, the test nucleic acid is the genome of the biological sample, and the first value satisfies the predetermined cut-off value when the first value is 10 −14 or less or when the first value is 10 −15 or less. 5 . The method claim 1 , wherein each bin in the plurality of bins represents at least 20 kilobases of the test nucleic acid, at least 50 kilobases of the test nucleic acid, at least 100 kilobases of the test nucleic acid, at least 250 kilobases of the test nucleic acid, or at least 500 kilobases of the test nucleic acid. 6 . The method of claim 1 , wherein each respective sequence read in each respective set of sequence reads in the plurality of sequence reads has a respective first portion that corresponds to a subset of the test nucleic acid that fully overlaps the different portion of the test nucleic acid that is represented by the bin corresponding to the respective set of sequence reads. 7 . The method of claim 1 , wherein the barcode in the second portion of each respective sequence read in the plurality of sequence reads encodes a unique predetermined value selected from the set {1, . . . , 1024}, selected from the set {1, . . . , 4096}, selected from the set {1, . . . , 16384}, selected from the set {1, . . . , 65536}, selected from the set {1, . . . , 262144}, selected from the set {1, . . . , 1048576}, selected from the set {1, . . . , 4194304}, selected from the set {1, . . . , 16777216}, selected from the set {1, . . . , 67108864}, or selected from the set {1, . . . , 1×10 12 }. 8 . The method of claim 1 , wherein the structural variation is deemed to have occurred, the method further comprising treating a subject that originated the biological sample with a treatment regimen responsive to the structural variation. 9 . The method claim 1 , wherein an identity of the first and second bin is determined by the identifying (C) using sparse matrix multiplication of the form: V=A 1 T A 2 , wherein, A 1 is a first B×N 1 matrix of barcodes that includes the first bin, A 2 is a second B×N 2 matrix of barcodes that inc

Assignees

Inventors

Classifications

  • G06F19/22Primary

    Physics · mapped topic

  • ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

  • ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations · CPC title

  • Drugs for disorders of the metabolism (of the blood or the extracellular fluid A61P7/00) · CPC title

  • C12Q1/6837Primary

    using probe arrays or probe chips (C12Q1/6874 takes precedence) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016232291A1 cover?
Systems and methods for determining structural variation and phasing using variant call data obtained from nucleic acid of a biological sample are provided. Sequence reads are obtained, each comprising a portion corresponding to a subset of the test nucleic acid and a portion encoding a barcode independent of the sequencing data. Bin information is obtained. Each bin represents a different port…
Who is the assignee on this patent?
10X Genomics Inc
What technology area does this patent fall under?
Primary CPC classification G06F19/22. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Aug 11 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).