Methods for comparative metagenomic analysis
US-2021249102-A1 · Aug 12, 2021 · US
US11809498B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11809498-B2 |
| Application number | US-201916676607-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 7, 2019 |
| Priority date | Nov 7, 2019 |
| Publication date | Nov 7, 2023 |
| Grant date | Nov 7, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods are disclosed for reducing the size of a k-mer reference database used for queries and/or taxonomic classifications when available computer storage and/or memory are inadequate. The k-mers of the reference database have been previously classified to a taxonomy, preferably based on genetic distances. In one method, the k-mers are separated into one or more groups followed by removing k-mers common to the groups. In another method, k-mers are removed based on a selected taxonomic threshold level. A third method combines the features of the previous two methods. The methods are adaptable to machine learning.
Opening claim text (preview).
What is claimed is: 1. A method for reducing computer memory requirements and increasing query speed to improve computational performance of a physical computer system configured to conduct taxonomic queries, comprising: providing a database comprising k-mers of one or more organisms classified to a taxonomy, wherein the database is greater than available computer memory of a computer system; in response to the database being greater than the available computer memory, dividing, by the computer system, the database into two or more independent groups of k-mers for at least organism A and organism B, wherein each of the groups comprises a unique set of nodes of the taxonomy, wherein all k-mers of a given node of nodes reside in only one of the groups and each of the groups is an independent data file; assigning a taxonomic threshold level of the taxonomy, wherein the taxonomic threshold level is automatically assigned by the computer system; providing the taxonomy as a self-consistent taxonomy that is independent of metadata associated with the k-mers from a standard taxonomy, wherein a map is generated that comprises associations of self-consistent identifications for each of the nodes in the self-consistent taxonomy to standard identifications in the standard taxonomy in response to the database being greater than the available computer memory, the self-consistent taxonomy being free of the metadata; removing, by the computer system, k-mers common to two or more of the groups, thereby forming two or more modified groups comprising the organism A and the organism B, each of the modified groups containing a unique set of k-mers for the organism A and the organism B, each of the modified groups being an independent data file; using, by the computer system, the k-mers of the modified groups as reference k-mers for comparison to computer queries and/or taxonomic classifications of k-mers of a sample in order to reduce query time and reduce computer storage on the available computer memory of the computer system, the sample comprising taxonomically unclassified sequenced nucleic acids of one or more organisms, wherein the computer queries and/or taxonomic classifications identifies at least one of the organisms of the sample; generating a matrix M, wherein the matrix M includes genetic distances between genomes; and performing a hash to determine the genetic distances, wherein the database is associated with pointers which point to rows in the database and wherein the k-mers are associated with the genomes. 2. The method of claim 1 , wherein the taxonomy is based on calculated genetic distances. 3. The method of claim 2 , wherein the genetic distances are genome-genome distances calculated using the MinHash algorithm. 4. The method of claim 1 , wherein the modified groups are stored on different computer nodes when used for said computer queries and/or for taxonomic classifications. 5. The method of claim 1 , wherein the removed k-mers are stored on a computer node separate from the modified groups. 6. The method of claim 1 , wherein the removed k-mers are used to confirm identification of an organism found in the queries and/or the classifications. 7. The method of claim 1 , wherein the one or more organisms are microorganisms selected from the group consisting of bacteria, fungi, viruses, protozoans, parasites, and combinations thereof. 8. The method of claim 1 , wherein the sample is selected from the group consisting of environmental samples, medical samples, and food samples. 9. A method for reducing computer memory requirements and increasing query speed to improve computational performance of a physical computer system configured to conduct taxonomic queries, comprising: providing a database comprising k-mers of one or more organisms classified to a taxonomy, wherein the database is greater than available computer memory of a computer system; assigning a taxonomic threshold level of the taxonomy, wherein the taxonomic threshold level is automatically assigned by the computer system; and in response to the database being greater than the available computer memory, removing, by the computer system, k-mers of the database that are classified to taxonomic levels above the threshold level, thereby forming a modified database having a size in bytes less than the database and suitable for the available computer memory of the computer system; using, by the computer system, the k-mers of the modified database as reference k-mers for comparison to computer queries and/or taxonomic classifications of k-mers of a sample in order to reduce query time and reduce computer storage on the available computer memory of the computer system, the sample comprising taxonomically unclassified sequenced nucleic acids of one or more organisms, wherein the computer queries and/or taxonomic classifications identifies at least one of the organisms of the sample; providing the taxonomy as a self-consistent taxonomy that is independent of metadata associated with the k-mers from a standard taxonomy, wherein a map is generated that comprises associations of self-consistent identifications for each node in the self-consistent taxonomy to standard identifications in the standard taxonomy in response to the database being greater than the available computer memory, the self-consistent taxonomy being free of the metadata; generating a matrix M, wherein the matrix M includes genetic distances between genomes; and performing a hash to determine the genetic distances, wherein the database is associated with pointers which point to rows in the database and wherein the k-mers are associated with the genomes. 10. The method of claim 9 , wherein the taxonomic threshold level is selected from the group consisting of family, genus, species, sub-species, and strain. 11. The method of claim 9 , wherein the taxonomic threshold level is selected by a machine using artificial intelligence. 12. A method for reducing computer memory requirements and increasing query speed to improve computational performance of a physical computer system configured to conduct taxonomic queries, comprising: providing a database comprising k-mers of one or more organisms classified to a taxonomy, wherein the database is greater than available computer memory of a computer system; assigning a taxonomic threshold level of the taxonomy, wherein the taxonomic threshold level is automatically assigned by the computer system; and in response to the database being greater than the available computer memory, removing, by the computer system, k-mers of the database that are classified to taxonomic levels above the threshold level, thereby forming a modified database; in response to the database being greater than the available computer memory, dividing, by the computer system, the modified database into two or more independent groups of k-mers for at least organism A and organism B, wherein each of the two or more groups comprises a unique set of nodes of the taxonomy and all k-mers of a given node of nodes reside in one of the groups, and wherein each of the groups is an independent data file; providing the taxonomy as a self-consistent taxonomy that is independent of metadata associated with the k-mers from a standard taxonomy, wherein a map is generated that comprises associations of self-consistent identifications for each of the nodes in the self-consistent taxonomy to standard identifications in the standard taxonomy in response to the database being greater than the available computer memory, the self-consistent taxonomy being free of the metadata; in response to the database being greater than the available computer memory, removing, by
Clustering; Classification · CPC title
using metadata automatically derived from the content · CPC title
Trees · CPC title
by searching ordered data, e.g. alpha-numerically ordered data · CPC title
Machine learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.