Optimizing k-mer databases by k-mer subtraction

US11809498B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11809498-B2
Application numberUS-201916676607-A
CountryUS
Kind codeB2
Filing dateNov 7, 2019
Priority dateNov 7, 2019
Publication dateNov 7, 2023
Grant dateNov 7, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods are disclosed for reducing the size of a k-mer reference database used for queries and/or taxonomic classifications when available computer storage and/or memory are inadequate. The k-mers of the reference database have been previously classified to a taxonomy, preferably based on genetic distances. In one method, the k-mers are separated into one or more groups followed by removing k-mers common to the groups. In another method, k-mers are removed based on a selected taxonomic threshold level. A third method combines the features of the previous two methods. The methods are adaptable to machine learning.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for reducing computer memory requirements and increasing query speed to improve computational performance of a physical computer system configured to conduct taxonomic queries, comprising: providing a database comprising k-mers of one or more organisms classified to a taxonomy, wherein the database is greater than available computer memory of a computer system; in response to the database being greater than the available computer memory, dividing, by the computer system, the database into two or more independent groups of k-mers for at least organism A and organism B, wherein each of the groups comprises a unique set of nodes of the taxonomy, wherein all k-mers of a given node of nodes reside in only one of the groups and each of the groups is an independent data file; assigning a taxonomic threshold level of the taxonomy, wherein the taxonomic threshold level is automatically assigned by the computer system; providing the taxonomy as a self-consistent taxonomy that is independent of metadata associated with the k-mers from a standard taxonomy, wherein a map is generated that comprises associations of self-consistent identifications for each of the nodes in the self-consistent taxonomy to standard identifications in the standard taxonomy in response to the database being greater than the available computer memory, the self-consistent taxonomy being free of the metadata; removing, by the computer system, k-mers common to two or more of the groups, thereby forming two or more modified groups comprising the organism A and the organism B, each of the modified groups containing a unique set of k-mers for the organism A and the organism B, each of the modified groups being an independent data file; using, by the computer system, the k-mers of the modified groups as reference k-mers for comparison to computer queries and/or taxonomic classifications of k-mers of a sample in order to reduce query time and reduce computer storage on the available computer memory of the computer system, the sample comprising taxonomically unclassified sequenced nucleic acids of one or more organisms, wherein the computer queries and/or taxonomic classifications identifies at least one of the organisms of the sample; generating a matrix M, wherein the matrix M includes genetic distances between genomes; and performing a hash to determine the genetic distances, wherein the database is associated with pointers which point to rows in the database and wherein the k-mers are associated with the genomes. 2. The method of claim 1 , wherein the taxonomy is based on calculated genetic distances. 3. The method of claim 2 , wherein the genetic distances are genome-genome distances calculated using the MinHash algorithm. 4. The method of claim 1 , wherein the modified groups are stored on different computer nodes when used for said computer queries and/or for taxonomic classifications. 5. The method of claim 1 , wherein the removed k-mers are stored on a computer node separate from the modified groups. 6. The method of claim 1 , wherein the removed k-mers are used to confirm identification of an organism found in the queries and/or the classifications. 7. The method of claim 1 , wherein the one or more organisms are microorganisms selected from the group consisting of bacteria, fungi, viruses, protozoans, parasites, and combinations thereof. 8. The method of claim 1 , wherein the sample is selected from the group consisting of environmental samples, medical samples, and food samples. 9. A method for reducing computer memory requirements and increasing query speed to improve computational performance of a physical computer system configured to conduct taxonomic queries, comprising: providing a database comprising k-mers of one or more organisms classified to a taxonomy, wherein the database is greater than available computer memory of a computer system; assigning a taxonomic threshold level of the taxonomy, wherein the taxonomic threshold level is automatically assigned by the computer system; and in response to the database being greater than the available computer memory, removing, by the computer system, k-mers of the database that are classified to taxonomic levels above the threshold level, thereby forming a modified database having a size in bytes less than the database and suitable for the available computer memory of the computer system; using, by the computer system, the k-mers of the modified database as reference k-mers for comparison to computer queries and/or taxonomic classifications of k-mers of a sample in order to reduce query time and reduce computer storage on the available computer memory of the computer system, the sample comprising taxonomically unclassified sequenced nucleic acids of one or more organisms, wherein the computer queries and/or taxonomic classifications identifies at least one of the organisms of the sample; providing the taxonomy as a self-consistent taxonomy that is independent of metadata associated with the k-mers from a standard taxonomy, wherein a map is generated that comprises associations of self-consistent identifications for each node in the self-consistent taxonomy to standard identifications in the standard taxonomy in response to the database being greater than the available computer memory, the self-consistent taxonomy being free of the metadata; generating a matrix M, wherein the matrix M includes genetic distances between genomes; and performing a hash to determine the genetic distances, wherein the database is associated with pointers which point to rows in the database and wherein the k-mers are associated with the genomes. 10. The method of claim 9 , wherein the taxonomic threshold level is selected from the group consisting of family, genus, species, sub-species, and strain. 11. The method of claim 9 , wherein the taxonomic threshold level is selected by a machine using artificial intelligence. 12. A method for reducing computer memory requirements and increasing query speed to improve computational performance of a physical computer system configured to conduct taxonomic queries, comprising: providing a database comprising k-mers of one or more organisms classified to a taxonomy, wherein the database is greater than available computer memory of a computer system; assigning a taxonomic threshold level of the taxonomy, wherein the taxonomic threshold level is automatically assigned by the computer system; and in response to the database being greater than the available computer memory, removing, by the computer system, k-mers of the database that are classified to taxonomic levels above the threshold level, thereby forming a modified database; in response to the database being greater than the available computer memory, dividing, by the computer system, the modified database into two or more independent groups of k-mers for at least organism A and organism B, wherein each of the two or more groups comprises a unique set of nodes of the taxonomy and all k-mers of a given node of nodes reside in one of the groups, and wherein each of the groups is an independent data file; providing the taxonomy as a self-consistent taxonomy that is independent of metadata associated with the k-mers from a standard taxonomy, wherein a map is generated that comprises associations of self-consistent identifications for each of the nodes in the self-consistent taxonomy to standard identifications in the standard taxonomy in response to the database being greater than the available computer memory, the self-consistent taxonomy being free of the metadata; in response to the database being greater than the available computer memory, removing, by

Assignees

Inventors

Classifications

  • G06F16/906Primary

    Clustering; Classification · CPC title

  • using metadata automatically derived from the content · CPC title

  • Trees · CPC title

  • by searching ordered data, e.g. alpha-numerically ordered data · CPC title

  • Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11809498B2 cover?
Methods are disclosed for reducing the size of a k-mer reference database used for queries and/or taxonomic classifications when available computer storage and/or memory are inadequate. The k-mers of the reference database have been previously classified to a taxonomy, preferably based on genetic distances. In one method, the k-mers are separated into one or more groups followed by removing k-m…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/906. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 07 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).