Synthetic WGS bioinformatics validation

US10984890B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10984890-B2
Application numberUS-201715639819-A
CountryUS
Kind codeB2
Filing dateJun 30, 2017
Priority dateJun 30, 2016
Publication dateApr 20, 2021
Grant dateApr 20, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods, and devices for generating synthetic genomic datasets and validating bioinformatic pipelines for genomic analysis are disclosed. In preferred embodiments, synthetic maternal and paternal datasets with known variants are used with matched normal synthetic datasets to validate various bioinformatic pipelines. Bioinformatic pipelines are evaluated using the synthetic datasets to assess design changes and improvements. Accuracy, PPV, specificity, sensitivity, reproducibility, and limit of detection of the pipelines in calling variants in synthetic datasets is reported.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method of testing or validating an algorithm associated with genomic analysis, comprising: introducing at least 500 single nucleotide polymorphism (SNP) sequences at a predetermined frequency and distribution into at least one set of autosome sequence data and a set of X-chromosome sequence data from a reference genome sequence to prepare a set of synthetic maternal autosome and X-Chromosome genome sequence data; introducing at least 500 SNP sequences at a predetermined frequency and distribution into at least one set of autosome sequence data of the reference genome sequence and a set of X- or Y-chromosome sequence data of the reference genome sequence to prepare a set of synthetic paternal genome sequence data; and merging the maternal and paternal synthetic genome sequences data into a combined synthetic genomic dataset; inputting the combined synthetic genomic dataset into the algorithm; and preparing a performance report listing deviations from the combined synthetic genomic dataset and wherein the maternal and paternal genome sequence data and synthetic genomic dataset are being manipulated on a non-transitory computer readable storage medium. 2. The method of claim 1 further comprising a step of sampling the combined synthetic genomic dataset to thereby produce a plurality of simulated reads. 3. The method of claim 2 wherein the step of sampling is performed to simulate a read coverage of at least 25×. 4. The method of claim 2 wherein the step of sampling is performed using a read error and base quality profile representative of a frozen tissue sample. 5. The method of claim 2 wherein the step of sampling is performed to produce simulated reads having a length of between 100 and 400 bases. 6. The method of claim 1 further comprising a step of including into the combined synthetic genomic dataset a list identifying type and position of the SNPs relative to the reference genome sequence data. 7. The method claim 1 further comprising a step of introducing into at least one of the synthetic maternal genome sequence data and the paternal genome sequence data a further genomic change selected from the group consisting of a single nucleotide variant (SNV), an indel, and a copy number alteration to thereby produce a synthetic somatic data set. 8. The method of claim 7 wherein the synthetic somatic data set further comprises a list identifying type and position of the further genomic change relative to the at least one of the synthetic maternal and paternal genome. 9. The method of claim 7 wherein the synthetic somatic data set further comprises a plurality of simulated reads from the synthetic somatic data set. 10. The method of claim 7 wherein the SNVs are based on at least one of COSMIC mutations, somatic TCGA mutations, and random locations in the genome. 11. The method of claim 7 wherein the copy number alteration is selected from the group consisting of (i) 25 small deletions, each with a size of 5,000 bp to 500,000 bp; (ii) 25 small tandem amplifications, each with a size of 5,000 bp to 500,000 bp and each having a copy number between 2 and 5; (iii) 10 small tandem hyperamplifications, with a size of 5,000 to 500,000 bp, and a copy number between 15 and 30; and (iv) large arm/chromosome deletions, each with a size between 30% and 100% of a chromosome, anchored to a telomere. 12. The method of claim 1 further comprising a step of including into the combined synthetic genomic dataset a plurality of simulated reads from the combined synthetic genomic dataset. 13. The method of claim 1 , wherein the algorithm is an algorithm that groups a plurality of simulated reads from the combined synthetic genomic dataset. 14. The method of claim 1 , wherein the algorithm is an algorithm that annotates a plurality or group of simulated reads from the combined synthetic genomic dataset. 15. The method of claim 1 , wherein the algorithm is an algorithm that outputs a plurality of simulated reads from the combined synthetic genomic dataset between a sequencing device and an analysis engine. 16. The method of claim 1 , wherein the algorithm is an algorithm that assembles and indexes a plurality of simulated reads from the combined synthetic genomic dataset. 17. The method of claim 1 , wherein the algorithm is a variant calling algorithm. 18. A method of validating operation of a plurality of computing devices that are informationally coupled to each other, comprising a step of using the combined synthetic genomic dataset of claim 1 as an input into a first of the devices, and using an output of the first of the devices as input into a second of the devices.

Assignees

Inventors

Classifications

  • Design of libraries · CPC title

  • G16B30/00Primary

    ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

  • Mutagenesis · CPC title

  • ICT programming tools or database systems specially adapted for bioinformatics · CPC title

  • Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10984890B2 cover?
Systems, methods, and devices for generating synthetic genomic datasets and validating bioinformatic pipelines for genomic analysis are disclosed. In preferred embodiments, synthetic maternal and paternal datasets with known variants are used with matched normal synthetic datasets to validate various bioinformatic pipelines. Bioinformatic pipelines are evaluated using the synthetic datasets to …
Who is the assignee on this patent?
Nantomics Llc
What technology area does this patent fall under?
Primary CPC classification G16B30/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 20 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).