Models for targeted sequencing of rna

US2020105375A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2020105375-A1
Application numberUS-201916584936-A
CountryUS
Kind codeA1
Filing dateSep 26, 2019
Priority dateSep 28, 2018
Publication dateApr 2, 2020
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for processing sequencing data of ribonucleic acid (RNA) molecules from a test sample include obtaining a plurality of sequence reads each derived from a RNA molecule obtained from the test sample, filtering the plurality of sequence reads, identifying one or more candidate variants from the filtered plurality of sequence reads, determining a quality score for each of the identified one or more candidate variants, the quality score indicating a likelihood that the candidate variant is a false positive detection of a mutation in the RNA molecule, and outputting the one or more candidate variants having a quality score greater than a threshold quality score.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for processing sequencing data of ribonucleic acid (RNA) molecules from a test sample, the method comprising: obtaining a plurality of sequence reads each derived from a RNA molecule obtained from the test sample; filtering the plurality of sequence reads; identifying one or more candidate variants from the filtered plurality of sequence reads; determining a quality score for each of the identified one or more candidate variants, the quality score indicating a likelihood that the candidate variant is a false positive detection of a mutation in the RNA molecule; and outputting the one or more candidate variants having a quality score greater than a threshold quality score. 2 . The method of claim 1 , wherein obtaining the plurality of sequence reads comprises: obtaining the test sample from an individual, the test sample comprising a plurality of RNA molecules; preparing a RNA sequencing library from the plurality of RNA molecules; and generating the plurality of sequence reads from the RNA sequencing library. 3 . The method of claim 2 , wherein the sequencing library is enriched for one or more targeted RNA molecules prior to obtaining the plurality of sequence reads. 4 . The method of claim 2 , wherein the plurality of sequence reads are obtained using next-generation sequencing of the RNA sequencing library. 5 . The method of claim 2 , wherein the plurality of RNA molecules are RNA transcripts, wherein the RNA transcripts are messenger RNA, transfer RNA, or ribosomal RNA. 6 . The method of claim 1 , wherein filtering the plurality of sequence reads comprises: filtering at least one sequence read of the plurality of sequence reads having a least a threshold number of continuous nucleotide base mutations; filtering at least one sequence read of the plurality of sequence reads having at least a threshold depth; and/or filtering out a number of leading nucleotide bases of at least one sequence read of the plurality of sequence reads. 7 . The method of claim 6 , wherein the threshold number of continuous nucleotide base mutations is at least three, the threshold depth is at least 50,000, or the number of leading nucleotide bases is six. 8 . The method of claim 1 , wherein the threshold quality score is determined by performing calibration using a plurality of calibration samples, each calibration sample including one or more control RNA molecules and a plurality of RNA molecules from one or more individuals. 9 . The method of claim 8 , wherein the one or more control RNA molecules are associated with External RNA Controls Consortium (ERCC) Spike-In Control Mixes, and wherein the one or more individuals are healthy. 10 . The method of claim 8 , wherein performing the calibration using calibration samples comprises: for each of the plurality of calibration samples: determining a depth of the calibration sample; and determining a sensitivity of the calibration sample, the sensitivity indicating a likelihood of detecting false positive mutations in the calibration sample. 11 . The method of claim 1 , wherein determining the quality score for a candidate variant comprises: accessing a plurality of parameters including a dispersion parameter r and a mean rate parameter m specific to the candidate variant, the r and m having been derived using a model; inputting read information of the plurality of sequence reads into a function parameterized by the plurality of parameters; and determining the quality score for the candidate variant using an output of the function based on the input read information. 12 . The method of claim 11 , wherein the plurality of parameters represent mean and shape parameters of a gamma distribution, and wherein the function is a negative binomial based on the plurality of sequence reads and the plurality of parameters. 13 . The method of claim 11 , wherein the plurality of parameters represent parameters of a distribution that encodes an uncertainty level of nucleotide mutations with respect to a given position of a sequence read. 14 . The method of claim 13 , wherein a gamma distribution is one component of a mixture of the distribution. 15 . The method of claim 11 , wherein the plurality of parameters are derived from a training sample of sequence reads from a plurality of healthy individuals. 16 . The method of claim 15 , wherein the training sample excludes a subset of the sequence reads from the plurality of healthy individuals based on filtering criteria when the sequence reads that have (i) a depth less than a threshold value or (ii) an allele frequency greater than a threshold frequency. 17 . The method of claim 11 , wherein the plurality of parameters are derived using a Bayesian Hierarchical model. 18 . The method of claim 17 , wherein the Bayesian Hierarchical model includes a multinomial distribution grouping positions of sequence reads into latent classes. 19 . The method of claim 17 , wherein the Bayesian Hierarchical model includes fixed covariates unrelated to training samples from healthy individuals, wherein the covariates are based on a plurality of nucleotides adjacent to a given position of a sequence read, or wherein the covariates are based on a level of uniqueness of a given sequence read relative to a target region of a genome. 20 . The method of claim 17 , wherein the Bayesian Hierarchical model is estimated using a Markov chain Monte Carlo method. 21 . The method of claim 20 , wherein the Markov chain Monte Carlo method uses a Metropolis-Hastings algorithm, a Gibbs sampling algorithm, or Hamiltonian mechanics. 22 . The method of claim 11 , wherein the sequence read information includes a depth d of the plurality of sequence reads, the function parameterized by m·d. 23 . The method of claim 11 , wherein the quality score is a Phred-scaled likelihood. 24 . The method of claim 11 , further comprising: determining that the candidate variant is a false positive mutation by comparing the quality score to a threshold quality score. 25 . The method of claim 24 , wherein the candidate variant is a single nucleotide variant. 26 . The method of claim 25 , wherein the model encodes noise levels of nucleotide mutations for one base of A, U, C, and G to each of the other three bases. 27 . The method of claim 11 , wherein the candidate variant is an insertion or deletion of at least one nucleotide. 28 . The method of claim 27 , wherein the model includes a distribution of lengths of insertions or deletions. 29 . The method of claim 28 , wherein the model separates an inference for determining a likelihood of an alternate allele from an inference for determining a length of the alternate allele using the distribution of lengths. 30 . The method of claim 28 , wherein the distribution of lengths comprises a multinomial with Dirichlet prior, wherein the Dirichlet prior on the multinomial distribution of lengths is determined by covariates of anchor positions of a genome. 31 . The method of claim 27 , wherein the model includes a distribution ω determined based on covariates. 32 . The method of claim 27 , wherein the model includes a distribution φ determined based on covariates and anchor positions of a genome. 33

Assignees

Inventors

Classifications

  • Supervised data analysis · CPC title

  • G16B30/10Primary

    Sequence alignment; Homology search · CPC title

  • G16B20/20Primary

    Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection · CPC title

  • ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding · CPC title

  • Sequence assembly · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2020105375A1 cover?
Systems and methods for processing sequencing data of ribonucleic acid (RNA) molecules from a test sample include obtaining a plurality of sequence reads each derived from a RNA molecule obtained from the test sample, filtering the plurality of sequence reads, identifying one or more candidate variants from the filtered plurality of sequence reads, determining a quality score for each of the id…
Who is the assignee on this patent?
Grail Inc
What technology area does this patent fall under?
Primary CPC classification G16B30/10. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Apr 02 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).