What technology area does this patent fall under?

Primary CPC classification G16B40/00. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 04 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Basecaller for DNA sequencing using machine learning

US10068053B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10068053-B2
Application number	US-201414571022-A
Country	US
Kind code	B2
Filing date	Dec 15, 2014
Priority date	Dec 16, 2013
Publication date	Sep 4, 2018
Grant date	Sep 4, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatuses are provided for creating and using a machine-leaning model to call a base at a position of a nucleic acid based on intensity values measured during a production sequencing run. The model can be trained using training data from training sequencing runs performed earlier. The model is trained using intensity values and assumed sequences that are determined as the correct output. The training data can be filtered to improve accuracy. The training data can be selected in a specific manner to be representative of the type of organism to be sequenced. The model can be trained to use intensity signals from multiple cycles and from neighboring nucleic acids to improve accuracy in the base calls.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of calling one or more bases for a nucleic acid of an organism, the method comprising: receiving, at a computer system, a basecalling model, the basecalling model configured to: receive inputs of intensity values for bases at one or more positions on a nucleic acid, and output a base call for each of the one or more positions, wherein the basecalling model is trained using a statistically significant number of assumed sequences of training nucleic acids and corresponding intensity values for bases at the positions of the assumed sequences, the corresponding intensity values being obtained from one or more first sequencing processes of training nucleic acids; receiving, at the computer system, sequencing data of test nucleic acids from a second sequencing process that is different from any of the one or more first sequencing processes, the sequencing data including intensity values for bases at a plurality of positions of a first test nucleic acid; for each of N positions of the first test nucleic acid: identifying intensity values corresponding to the position; determining, by the computer system, a first base call at a first position of the N positions using the basecalling model based on inputs of the intensity values for the N positions, where N is an integer greater than 1, wherein the basecalling model provides scores for each of a plurality of bases, and wherein determining the first base call includes: calculating, by the computer system, scores for each of the plurality of bases at the first position of the N positions using the basecalling model based on inputs of the intensity values for the N positions; and calling, by the computer system, the base corresponding to a highest score for the first position when the highest score satisfies one or more criteria; and calling a base at M positions based on the scores at the N positions, where M is less than or equal to N and greater than one. 2. The method of claim 1 , wherein an intensity value corresponds to a plurality of positions, and each score corresponds to the plurality of positions or to a particular base at one of the plurality of positions. 3. The method of claim 1 , wherein the basecalling model includes a neural network. 4. The method of claim 3 , wherein the neural network outputs raw scores, and wherein the basecalling model includes a post-processing function that modifies the raw scores. 5. The method of claim 3 , wherein the basecalling model includes a plurality of neural networks, the method further comprising: for each of the plurality of bases: determining a respective score using each of the plurality of neural networks; calculating a combined score from the respective scores; and using the combined score as the score for the base at the first position. 6. The method of claim 1 , wherein each intensity value corresponds to one base, and wherein multiple intensity values corresponds to one base. 7. The method of claim 1 , further comprising: performing the second sequencing process on the test nucleic acids. 8. The method of claim 1 , wherein the N positions are not sequential. 9. The method of claim 1 , wherein the basecalling model includes a plurality of intermediate models, the method further comprising: for each of the intermediate models: making a respective base call; determining a consensus base call from the respective base calls; and using the consensus base call for the first position. 10. The method of claim 1 , wherein the basecalling model is further configured to receive inputs of intensity values for one or more neighboring nucleic acids that neighbor the first test nucleic acid. 11. The method of claim 10 , wherein the intensity values for one or more neighboring nucleic acids are for a same cycle as the first position of the first test nucleic acid. 12. The method of claim 10 , wherein the one or more neighboring nucleic acids are within a specified distance. 13. The method of claim 12 , wherein the first nucleic acid and the one or more neighboring nucleic acids are on an ordered lattice, and wherein the specified distance is a number of lattice points separating the first test nucleic acid and the one or more neighboring nucleic acids. 14. The method of claim 12 , wherein the first nucleic acid and the one or more neighboring nucleic acids are not ordered, and wherein the specified distance is a length. 15. A computer product comprising a computer readable medium storing a plurality of instructions for controlling a processor to perform the method of claim 1 . 16. The method of claim 1 , further comprising creating the basecalling model by: receiving sequencing data of training nucleic acids from the one or more first sequencing processes, the sequencing data including intensity values for bases at positions of the training nucleic acids, the training nucleic acids being from one or more training samples; for each of a set of the training nucleic acids: performing an initial base call at positions of the training nucleic acid to obtain an initial sequence based at least on the intensity values at the positions of the training nucleic acid; and determining an assumed sequence corresponding to the initial sequence, wherein the assumed sequence is assumed to be a correct sequence for the positions of the training nucleic acid; and generating the basecalling model using the assumed sequences and the intensity values corresponding to the assumed sequences. 17. A method of calling one or more bases for a nucleic acid of an organism, the method comprising: receiving, at a computer system, a basecalling model, the basecalling model configured to: receive inputs of intensity values for bases at one or more positions on a nucleic acid, and output a base call for each of the one or more positions, wherein the basecalling model is trained using a statistically significant number of assumed sequences of training nucleic acids and corresponding intensity values for bases at the positions of the assumed sequences, the corresponding intensity values being obtained from one or more first sequencing processes of training nucleic acids; receiving, at the computer system, sequencing data of test nucleic acids from a second sequencing process that is different from any of the one or more first sequencing processes, the sequencing data including intensity values for bases at a plurality of positions of a first test nucleic acid; for each of N positions of the first test nucleic acid: identifying intensity values corresponding to the position; determining, by the computer system, a first base call at a first position of the N positions using the basecalling model based on inputs of the intensity values for the N positions, where N is an integer equal to or greater than 1, wherein the basecalling model provides scores for each of a plurality of bases, and wherein determining the first base call includes: calculating, by the computer system, scores for each of the plurality of bases at the first position of the N positions using the basecalling model based on inputs of the intensity values for the N positions; and calling, by the computer system, the base corresponding to a highest score for the first position when the highest score satisfies one or more criteria, and wherein the one or more criteria include at least one of: the highest score being greater than a first threshold, and a difference between the highest score and a next highest score being greater than a second threshold. 18. The meth

Assignees

Complete Genomics Inc

Inventors

Classifications

G16B40/00Primary
ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding · CPC title
G16B30/00
ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title
G06F19/22
Physics · mapped topic
G06F19/24Primary
Physics · mapped topic
G16B40/20Primary
Supervised data analysis · CPC title

Patent family

Related publications grouped by family.

View patent family 53368796

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10068053B2 cover?: Methods, systems, and apparatuses are provided for creating and using a machine-leaning model to call a base at a position of a nucleic acid based on intensity values measured during a production sequencing run. The model can be trained using training data from training sequencing runs performed earlier. The model is trained using intensity values and assumed sequences that are determined as th…
Who is the assignee on this patent?: Complete Genomics Inc
What technology area does this patent fall under?: Primary CPC classification G16B40/00. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 04 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).