K-mer database for organism identification

US11830580B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11830580-B2
Application numberUS-201816147779-A
CountryUS
Kind codeB2
Filing dateSep 30, 2018
Priority dateSep 30, 2018
Publication dateNov 28, 2023
Grant dateNov 28, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A large collection of sample genomes containing misclassified k-mers and metadata errors from a reference taxonomy was converted to a self-consistent k-mer database comprising a self-consistent taxonomy. The self-consistent taxonomy was based on genetic distances calculated using the MinHash method or the Meier-Koltoff method. An agglomerative clustering algorithm was used to calculate the self-consistent taxonomy. Each k-mer of the sample genomes was assigned to only one node of the self-consistent taxonomy. In another step, each node of the self-consistent taxonomy was mapped to the reference taxonomy, thereby preserving in the self-consistent taxonomy links to the reference taxonomy while correcting for the misclassification errors therein. The self-consistent k-mer database can be used to taxonomically profile sequenced nucleic acids with greater specificity compared to systems relying on the reference taxonomy.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: providing a reference database comprising reference k-mers, the reference k-mers derived from sequenced nucleic acids of one or more organisms, wherein the reference k-mers are classified to nodes of a reference taxonomy, the reference taxonomy not based on genetic distances, the nodes of the reference taxonomy representing genome classifications, the nodes of the reference taxonomy having unique reference IDs, wherein IDs means identifications; providing a sample database comprising sample genomes that includes genomes of the one or more organisms; calculating genetic distances of the sample genomes, thereby forming a distance matrix; calculating a self-consistent taxonomy using the distance matrix; constructing a self-consistent k-mer database comprising k-mers of the sample genomes, wherein the k-mers of the sample genomes are assigned to nodes of the self-consistent taxonomy based on genetic distance, the nodes of the self-consistent taxonomy assigned respective unique self-consistent IDs, and each of the k-mers of the sample genomes is linked to a respective one of the self-consistent IDs; mapping the reference k-mers and associated reference IDs to the self-consistent k-mer database by storing, at each respective node of the self-consistent k-mer database, a 3-tuple comprising a k-mer, one or more respective reference IDs, and a single respective self-consistent ID, thereby linking reference IDs to self-consistent IDs, wherein each of the self-consistent IDs assigned to a leaf node of the self-consistent taxonomy is mapped to exactly one reference ID and each of the self-consistent IDs assigned to an internal node of the self-consistent taxonomy is mapped to each respective reference ID below the respective internal node; calculating respective weights and/or respective probabilities of the mapped reference IDs based on the number of nodes of the self-consistent taxonomy linked to each of the mapped reference IDs, wherein each of the mapped reference IDs of a given node of the self-consistent taxonomy is assigned a calculated weight and/or a calculated probability; compressing the self-consistent k-mer database by removing all nodes of the self-consistent k-mer database having one and only one reference ID so long as the respective parent node contains one and only one reference ID; querying the self-consistent k-mer database for taxonomic profiling of a taxonomically unclassified k-mer of a sequenced nucleic acid; and classifying the taxonomically unclassified k-mer to a node of the self-consistent k-mer database, wherein classifying the taxonomically unclassified k-mer comprises assigning both a self-consistent ID and a reference ID to the taxonomically unclassified k-mer. 2. The method of claim 1 , wherein at least one of the k-mers of the reference database is misclassified in the reference taxonomy. 3. The method of claim 1 , wherein at least one of the sample genomes is misclassified in the reference taxonomy. 4. The method of claim 1 , wherein the method comprises condensing, into a single node, two or more nodes of the self-consistent taxonomy that share a common reference ID. 5. The method of claim 1 , wherein the one or more organisms are prokaryotes. 6. The method of claim 1 , wherein the self-consistent taxonomy is based exclusively on the calculated genetic distances. 7. The method of claim 1 , wherein no two nodes of the self-consistent taxonomy are linked to an identical k-mer. 8. The method of claim 1 , wherein the nodes of the self-consistent taxonomy comprise parent nodes linked to child nodes, and no two child nodes of a common parent node are linked to an identical reference ID. 9. The method of claim 1 , wherein the genetic distances are selected from the group consisting of genome-genome distances, gene-gene distances, protein domain-protein domain distances, and protein-protein distances. 10. The method of claim 1 , wherein the genetic distances are genome-genome distances calculated using the MinHash algorithm. 11. The method of claim 1 , wherein the genetic distances are genome-genome distances calculated using the Meier-Koltoff algorithm. 12. The method of claim 1 , wherein the genetic distances are gene-gene distances calculated using Nei's standard genetic distance. 13. The method of claim 1 , wherein the genetic distances are gene-gene distances calculated using pairwise distance method. 14. The method of claim 1 , wherein the genetic distances are protein domain-protein domain distances. 15. The method of claim 1 , wherein the k-mers are assigned to nodes of the self-consistent taxonomy using an agglomerative hierarchical algorithm. 16. The method of claim 1 , wherein the agglomerative hierarchical algorithm is selected from the group consisting of i) single linkage (SLINK), ii) complete linkage (CLINK), iii) unweighted pair-group method using arithmetic averages (UPGMA), iv) weighted arithmetic average clustering (WPGMA), v) Ward method, vi) unweighted centroid clustering (UPGMC) and vii) weighted centroid clustering (WPGMC). 17. A system comprising one or more computer processor circuits configured and arranged to: access a reference database comprising reference k-mers derived from sequenced nucleic acids of one or more organisms, wherein the reference k-mers are classified to nodes of a reference taxonomy, the reference taxonomy not based on genetic distances, the nodes of the reference taxonomy representing genome classifications, the nodes of the reference taxonomy having unique reference IDs, wherein IDs means identifications; access a sample database comprising sample genomes that includes genomes of the one or more organisms; calculate genetic distances of the sample genomes, thereby forming a distance matrix; calculate a self-consistent taxonomy using the distance matrix; construct a self-consistent k-mer database comprising k-mers of the sample genomes, wherein the k-mers of the sample genomes are assigned to nodes of the self-consistent taxonomy based on genetic distance, the nodes of the self-consistent taxonomy assigned respective unique self-consistent IDs, and each of the k-mers of the sample genomes is linked to a respective one of the self-consistent IDs; map the reference k-mers to the k-mers of the self-consistent k-mer database by storing, at each respective node of the self-consistent k-mer database, a 3-tuple comprising a k-mer, one or more respective reference IDs, and a respective self-consistent ID, thereby mapping reference IDs to self-consistent IDs, wherein each of the self-consistent IDs assigned to a leaf node of the self-consistent taxonomy is mapped to exactly one reference ID and each of the self-consistent IDs assigned to an internal node of the self-consistent taxonomy is mapped to each respective reference ID below the respective internal node; calculate respective weights and/or respective probabilities of the mapped reference IDs based on number of nodes of the self-consistent taxonomy linked to each of the mapped reference IDs, wherein each of the mapped reference IDs of a given node of the self-consistent taxonomy is assigned a calculated weight and/or a calculated probability; compress the self-consistent k-mer database by removing all nodes of the self-consistent k-mer database having one and only one reference ID so long as the respective parent node contains one and only one reference ID; query the self-consistent k-mer database for taxonomic profiling of a taxonomically unclassified k-mer of a sequenced nucleic acid; and classify the taxonomi

Assignees

Inventors

Classifications

  • G16B30/00Primary

    ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

  • Clustering or classification · CPC title

  • ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations · CPC title

  • G16B30/10Primary

    Sequence alignment; Homology search · CPC title

  • Ontologies; Annotations · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11830580B2 cover?
A large collection of sample genomes containing misclassified k-mers and metadata errors from a reference taxonomy was converted to a self-consistent k-mer database comprising a self-consistent taxonomy. The self-consistent taxonomy was based on genetic distances calculated using the MinHash method or the Meier-Koltoff method. An agglomerative clustering algorithm was used to calculate the self…
Who is the assignee on this patent?
IBM, Mars Inc
What technology area does this patent fall under?
Primary CPC classification G16B30/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).