Basecaller for DNA sequencing using machine learning

US10068053B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10068053-B2
Application numberUS-201414571022-A
CountryUS
Kind codeB2
Filing dateDec 15, 2014
Priority dateDec 16, 2013
Publication dateSep 4, 2018
Grant dateSep 4, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatuses are provided for creating and using a machine-leaning model to call a base at a position of a nucleic acid based on intensity values measured during a production sequencing run. The model can be trained using training data from training sequencing runs performed earlier. The model is trained using intensity values and assumed sequences that are determined as the correct output. The training data can be filtered to improve accuracy. The training data can be selected in a specific manner to be representative of the type of organism to be sequenced. The model can be trained to use intensity signals from multiple cycles and from neighboring nucleic acids to improve accuracy in the base calls.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of calling one or more bases for a nucleic acid of an organism, the method comprising: receiving, at a computer system, a basecalling model, the basecalling model configured to: receive inputs of intensity values for bases at one or more positions on a nucleic acid, and output a base call for each of the one or more positions, wherein the basecalling model is trained using a statistically significant number of assumed sequences of training nucleic acids and corresponding intensity values for bases at the positions of the assumed sequences, the corresponding intensity values being obtained from one or more first sequencing processes of training nucleic acids; receiving, at the computer system, sequencing data of test nucleic acids from a second sequencing process that is different from any of the one or more first sequencing processes, the sequencing data including intensity values for bases at a plurality of positions of a first test nucleic acid; for each of N positions of the first test nucleic acid: identifying intensity values corresponding to the position; determining, by the computer system, a first base call at a first position of the N positions using the basecalling model based on inputs of the intensity values for the N positions, where N is an integer greater than 1, wherein the basecalling model provides scores for each of a plurality of bases, and wherein determining the first base call includes: calculating, by the computer system, scores for each of the plurality of bases at the first position of the N positions using the basecalling model based on inputs of the intensity values for the N positions; and calling, by the computer system, the base corresponding to a highest score for the first position when the highest score satisfies one or more criteria; and calling a base at M positions based on the scores at the N positions, where M is less than or equal to N and greater than one. 2. The method of claim 1 , wherein an intensity value corresponds to a plurality of positions, and each score corresponds to the plurality of positions or to a particular base at one of the plurality of positions. 3. The method of claim 1 , wherein the basecalling model includes a neural network. 4. The method of claim 3 , wherein the neural network outputs raw scores, and wherein the basecalling model includes a post-processing function that modifies the raw scores. 5. The method of claim 3 , wherein the basecalling model includes a plurality of neural networks, the method further comprising: for each of the plurality of bases: determining a respective score using each of the plurality of neural networks; calculating a combined score from the respective scores; and using the combined score as the score for the base at the first position. 6. The method of claim 1 , wherein each intensity value corresponds to one base, and wherein multiple intensity values corresponds to one base. 7. The method of claim 1 , further comprising: performing the second sequencing process on the test nucleic acids. 8. The method of claim 1 , wherein the N positions are not sequential. 9. The method of claim 1 , wherein the basecalling model includes a plurality of intermediate models, the method further comprising: for each of the intermediate models: making a respective base call; determining a consensus base call from the respective base calls; and using the consensus base call for the first position. 10. The method of claim 1 , wherein the basecalling model is further configured to receive inputs of intensity values for one or more neighboring nucleic acids that neighbor the first test nucleic acid. 11. The method of claim 10 , wherein the intensity values for one or more neighboring nucleic acids are for a same cycle as the first position of the first test nucleic acid. 12. The method of claim 10 , wherein the one or more neighboring nucleic acids are within a specified distance. 13. The method of claim 12 , wherein the first nucleic acid and the one or more neighboring nucleic acids are on an ordered lattice, and wherein the specified distance is a number of lattice points separating the first test nucleic acid and the one or more neighboring nucleic acids. 14. The method of claim 12 , wherein the first nucleic acid and the one or more neighboring nucleic acids are not ordered, and wherein the specified distance is a length. 15. A computer product comprising a computer readable medium storing a plurality of instructions for controlling a processor to perform the method of claim 1 . 16. The method of claim 1 , further comprising creating the basecalling model by: receiving sequencing data of training nucleic acids from the one or more first sequencing processes, the sequencing data including intensity values for bases at positions of the training nucleic acids, the training nucleic acids being from one or more training samples; for each of a set of the training nucleic acids: performing an initial base call at positions of the training nucleic acid to obtain an initial sequence based at least on the intensity values at the positions of the training nucleic acid; and determining an assumed sequence corresponding to the initial sequence, wherein the assumed sequence is assumed to be a correct sequence for the positions of the training nucleic acid; and generating the basecalling model using the assumed sequences and the intensity values corresponding to the assumed sequences. 17. A method of calling one or more bases for a nucleic acid of an organism, the method comprising: receiving, at a computer system, a basecalling model, the basecalling model configured to: receive inputs of intensity values for bases at one or more positions on a nucleic acid, and output a base call for each of the one or more positions, wherein the basecalling model is trained using a statistically significant number of assumed sequences of training nucleic acids and corresponding intensity values for bases at the positions of the assumed sequences, the corresponding intensity values being obtained from one or more first sequencing processes of training nucleic acids; receiving, at the computer system, sequencing data of test nucleic acids from a second sequencing process that is different from any of the one or more first sequencing processes, the sequencing data including intensity values for bases at a plurality of positions of a first test nucleic acid; for each of N positions of the first test nucleic acid: identifying intensity values corresponding to the position; determining, by the computer system, a first base call at a first position of the N positions using the basecalling model based on inputs of the intensity values for the N positions, where N is an integer equal to or greater than 1, wherein the basecalling model provides scores for each of a plurality of bases, and wherein determining the first base call includes: calculating, by the computer system, scores for each of the plurality of bases at the first position of the N positions using the basecalling model based on inputs of the intensity values for the N positions; and calling, by the computer system, the base corresponding to a highest score for the first position when the highest score satisfies one or more criteria, and wherein the one or more criteria include at least one of: the highest score being greater than a first threshold, and a difference between the highest score and a next highest score being greater than a second threshold. 18. The meth

Assignees

Inventors

Classifications

  • G16B40/00Primary

    ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding · CPC title

  • ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

  • Physics · mapped topic

  • G06F19/24Primary

    Physics · mapped topic

  • G16B40/20Primary

    Supervised data analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10068053B2 cover?
Methods, systems, and apparatuses are provided for creating and using a machine-leaning model to call a base at a position of a nucleic acid based on intensity values measured during a production sequencing run. The model can be trained using training data from training sequencing runs performed earlier. The model is trained using intensity values and assumed sequences that are determined as th…
Who is the assignee on this patent?
Complete Genomics Inc
What technology area does this patent fall under?
Primary CPC classification G16B40/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 04 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).