System, method and computer readable medium for rapid dna identification
US-2016132640-A1 · May 12, 2016 · US
US11830580B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11830580-B2 |
| Application number | US-201816147779-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 30, 2018 |
| Priority date | Sep 30, 2018 |
| Publication date | Nov 28, 2023 |
| Grant date | Nov 28, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A large collection of sample genomes containing misclassified k-mers and metadata errors from a reference taxonomy was converted to a self-consistent k-mer database comprising a self-consistent taxonomy. The self-consistent taxonomy was based on genetic distances calculated using the MinHash method or the Meier-Koltoff method. An agglomerative clustering algorithm was used to calculate the self-consistent taxonomy. Each k-mer of the sample genomes was assigned to only one node of the self-consistent taxonomy. In another step, each node of the self-consistent taxonomy was mapped to the reference taxonomy, thereby preserving in the self-consistent taxonomy links to the reference taxonomy while correcting for the misclassification errors therein. The self-consistent k-mer database can be used to taxonomically profile sequenced nucleic acids with greater specificity compared to systems relying on the reference taxonomy.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: providing a reference database comprising reference k-mers, the reference k-mers derived from sequenced nucleic acids of one or more organisms, wherein the reference k-mers are classified to nodes of a reference taxonomy, the reference taxonomy not based on genetic distances, the nodes of the reference taxonomy representing genome classifications, the nodes of the reference taxonomy having unique reference IDs, wherein IDs means identifications; providing a sample database comprising sample genomes that includes genomes of the one or more organisms; calculating genetic distances of the sample genomes, thereby forming a distance matrix; calculating a self-consistent taxonomy using the distance matrix; constructing a self-consistent k-mer database comprising k-mers of the sample genomes, wherein the k-mers of the sample genomes are assigned to nodes of the self-consistent taxonomy based on genetic distance, the nodes of the self-consistent taxonomy assigned respective unique self-consistent IDs, and each of the k-mers of the sample genomes is linked to a respective one of the self-consistent IDs; mapping the reference k-mers and associated reference IDs to the self-consistent k-mer database by storing, at each respective node of the self-consistent k-mer database, a 3-tuple comprising a k-mer, one or more respective reference IDs, and a single respective self-consistent ID, thereby linking reference IDs to self-consistent IDs, wherein each of the self-consistent IDs assigned to a leaf node of the self-consistent taxonomy is mapped to exactly one reference ID and each of the self-consistent IDs assigned to an internal node of the self-consistent taxonomy is mapped to each respective reference ID below the respective internal node; calculating respective weights and/or respective probabilities of the mapped reference IDs based on the number of nodes of the self-consistent taxonomy linked to each of the mapped reference IDs, wherein each of the mapped reference IDs of a given node of the self-consistent taxonomy is assigned a calculated weight and/or a calculated probability; compressing the self-consistent k-mer database by removing all nodes of the self-consistent k-mer database having one and only one reference ID so long as the respective parent node contains one and only one reference ID; querying the self-consistent k-mer database for taxonomic profiling of a taxonomically unclassified k-mer of a sequenced nucleic acid; and classifying the taxonomically unclassified k-mer to a node of the self-consistent k-mer database, wherein classifying the taxonomically unclassified k-mer comprises assigning both a self-consistent ID and a reference ID to the taxonomically unclassified k-mer. 2. The method of claim 1 , wherein at least one of the k-mers of the reference database is misclassified in the reference taxonomy. 3. The method of claim 1 , wherein at least one of the sample genomes is misclassified in the reference taxonomy. 4. The method of claim 1 , wherein the method comprises condensing, into a single node, two or more nodes of the self-consistent taxonomy that share a common reference ID. 5. The method of claim 1 , wherein the one or more organisms are prokaryotes. 6. The method of claim 1 , wherein the self-consistent taxonomy is based exclusively on the calculated genetic distances. 7. The method of claim 1 , wherein no two nodes of the self-consistent taxonomy are linked to an identical k-mer. 8. The method of claim 1 , wherein the nodes of the self-consistent taxonomy comprise parent nodes linked to child nodes, and no two child nodes of a common parent node are linked to an identical reference ID. 9. The method of claim 1 , wherein the genetic distances are selected from the group consisting of genome-genome distances, gene-gene distances, protein domain-protein domain distances, and protein-protein distances. 10. The method of claim 1 , wherein the genetic distances are genome-genome distances calculated using the MinHash algorithm. 11. The method of claim 1 , wherein the genetic distances are genome-genome distances calculated using the Meier-Koltoff algorithm. 12. The method of claim 1 , wherein the genetic distances are gene-gene distances calculated using Nei's standard genetic distance. 13. The method of claim 1 , wherein the genetic distances are gene-gene distances calculated using pairwise distance method. 14. The method of claim 1 , wherein the genetic distances are protein domain-protein domain distances. 15. The method of claim 1 , wherein the k-mers are assigned to nodes of the self-consistent taxonomy using an agglomerative hierarchical algorithm. 16. The method of claim 1 , wherein the agglomerative hierarchical algorithm is selected from the group consisting of i) single linkage (SLINK), ii) complete linkage (CLINK), iii) unweighted pair-group method using arithmetic averages (UPGMA), iv) weighted arithmetic average clustering (WPGMA), v) Ward method, vi) unweighted centroid clustering (UPGMC) and vii) weighted centroid clustering (WPGMC). 17. A system comprising one or more computer processor circuits configured and arranged to: access a reference database comprising reference k-mers derived from sequenced nucleic acids of one or more organisms, wherein the reference k-mers are classified to nodes of a reference taxonomy, the reference taxonomy not based on genetic distances, the nodes of the reference taxonomy representing genome classifications, the nodes of the reference taxonomy having unique reference IDs, wherein IDs means identifications; access a sample database comprising sample genomes that includes genomes of the one or more organisms; calculate genetic distances of the sample genomes, thereby forming a distance matrix; calculate a self-consistent taxonomy using the distance matrix; construct a self-consistent k-mer database comprising k-mers of the sample genomes, wherein the k-mers of the sample genomes are assigned to nodes of the self-consistent taxonomy based on genetic distance, the nodes of the self-consistent taxonomy assigned respective unique self-consistent IDs, and each of the k-mers of the sample genomes is linked to a respective one of the self-consistent IDs; map the reference k-mers to the k-mers of the self-consistent k-mer database by storing, at each respective node of the self-consistent k-mer database, a 3-tuple comprising a k-mer, one or more respective reference IDs, and a respective self-consistent ID, thereby mapping reference IDs to self-consistent IDs, wherein each of the self-consistent IDs assigned to a leaf node of the self-consistent taxonomy is mapped to exactly one reference ID and each of the self-consistent IDs assigned to an internal node of the self-consistent taxonomy is mapped to each respective reference ID below the respective internal node; calculate respective weights and/or respective probabilities of the mapped reference IDs based on number of nodes of the self-consistent taxonomy linked to each of the mapped reference IDs, wherein each of the mapped reference IDs of a given node of the self-consistent taxonomy is assigned a calculated weight and/or a calculated probability; compress the self-consistent k-mer database by removing all nodes of the self-consistent k-mer database having one and only one reference ID so long as the respective parent node contains one and only one reference ID; query the self-consistent k-mer database for taxonomic profiling of a taxonomically unclassified k-mer of a sequenced nucleic acid; and classify the taxonomi
ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title
Clustering or classification · CPC title
ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations · CPC title
Sequence alignment; Homology search · CPC title
Ontologies; Annotations · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.