Identification of unknown genomes and closest known genomes

US12100486B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12100486-B2
Application numberUS-202117321371-A
CountryUS
Kind codeB2
Filing dateMay 14, 2021
Priority dateMay 14, 2021
Publication dateSep 24, 2024
Grant dateSep 24, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Provided is a deep learning algorithm that analyzes fragments of biological sequences. The input for the deep learning algorithm is a biological sequence fragment of unknown origin and the output is the closest known biological genome that could share phenotypic properties with the biological species of unknown origin. The workflow thus has application for genomic classification, identification of mutations within known genomes, and the identification of the closest class for unknown species.

First claim

Opening claim text (preview).

We claim: 1. A computer-implemented method comprising: training a deep learning model via inputting a training set of biological sequence fragments into the deep learning model, wherein the deep learning model converts the biological sequence fragments into k-mers one-hot encoded into a binary matrix, wherein the deep learning model extracts feature vectors from the biological sequence fragments of the training set, wherein the deep learning model is implemented on a computer, wherein the deep learning model maps the extracted feature vectors into a genome features space to identify clusters of extracted features, wherein the deep learning model computes an intra-cluster density for each extracted feature from the training set mapped in the genome features space, wherein the training is supervised training that uses labels of the training set to respectively associate the extracted features with genomes of the training set, wherein the training comprises the deep learning model determining an average training value of a training distribution among each of the extracted features, and wherein the training comprises the deep learning model heuristically determining a respective first threshold value for each of the extracted features from the training set; and inputting a sample set of biological sequence fragments into the trained deep learning model, wherein the sample set of the biological sequence fragments comprises a first sample and a second sample, wherein, in response to the inputting of the sample set, the trained deep learning model: converts the first sample and the second sample into k-mers one-hot encoded into a respective binary matrix, classifies each k-mer of the first sample and the second sample as a known k-mer from the training set or an unknown k-mer, predicts a first class for the first sample and a second class for the second sample based on the labels of the training set, maps the first sample and the second sample into the genome features space to determine, respectively, first values for the first samples for the features and second values for the second samples for the features, computes an output score for the first sample for each of the features and for the second sample for each of the features, wherein the output score is the difference between the respective first or second value and the average training value for the respective feature for the predicted first or second class, respectively, computes a respective degree of divergence for the first sample for each of the features and for the second sample for each of the features, the respective degree of divergence being a difference between the respective output score of the first or second sample for the feature and the intra-cluster density for the feature, compares the degrees of divergence of the first and the second samples for each of the features to the respective first threshold values that were heuristically determined from the training set, in response to the comparing of the degrees of divergence for the first sample to the first threshold values not indicating an anomaly, provides for the first sample a classification output comprising the predicted first class, and in response to the comparing of the degrees of divergence for the second sample to the first threshold values indicating an anomaly, provides for the second sample a classification output comprising an indication of an anomaly and the predicted second class as a closest genome to the second sample. 2. The method of claim 1 , wherein the biological sequence fragments are selected from the group consisting of a genomic sequence, a gene sequence, a protein sequence, and a protein domain sequence. 3. The method of claim 2 , wherein the genomic sequence is a microbe genomic sequence. 4. The method of claim 1 , wherein the deep learning algorithm is a convolution neural network. 5. The method of claim 4 , wherein the convolution neural network comprises a max pooling algorithm. 6. The method of claim 1 , wherein the comparing of the degrees of divergence to the respective first threshold values comprises: counting a number of instances in which the degree of divergence for a feature exceeds the first threshold value for the feature, and comparing the number of instances to a second threshold value, wherein the comparing of the degrees of divergence to the first threshold value does not indicate an anomaly when the number of instances does not exceed the second threshold value, and wherein the comparing of the degrees of divergence to the first threshold value does not indicate an anomaly when the number of instances does not exceed the second threshold value, and wherein the comparing of the degrees of divergence to the first threshold value indicates the anomaly when the number of instances exceeds the second threshold value. 7. The method of claim 6 , wherein the second threshold value is determined heuristically with a linear search. 8. The method of claim 7 , wherein the heuristic determinations of the first and the second threshold values are determined under the constraint that a specific number of k-mers among the genomes of the training set are outliers. 9. A computer program product comprising one or more computer readable storage media and program instructions collectively stored on the one or more computer readable storage media, the program instructions comprising: program instructions for training a deep learning model via inputting a training set of biological sequence fragments into the deep learning model, wherein the deep learning model converts the biological sequence fragments into k-mers one-hot encoded into a binary matrix, wherein the deep learning model extracts feature vectors from the biological sequence fragments of the training set, wherein the deep learning model is implemented on a computer; wherein the deep learning model maps the extracted feature vectors into a genome features space to identify clusters of extracted features, wherein the deep learning model computes an intra-cluster density for each extracted feature from the training set mapped in the genome features space, wherein the training is supervised training that uses labels of the training set to respectively associate the extracted features with genomes of the training set, wherein the training comprises the deep learning model determining an average training value of a training distribution among each of the extracted features, and wherein the training comprises the deep learning model heuristically determining a respective first threshold value for each of the extracted features from the training set; and program instructions for inputting a sample set of biological sequence fragments into the deep learning model, wherein the sample set of the biological sequence fragments comprises a first sample and a second sample, wherein, in response to the inputting of the sample set, the trained deep learning model: converts the first sample and the second sample into k-mers one-hot encoded into a respective binary matrix, classifies each k-mer of the first sample and of the second sample as a known k-mer from the training set or an unknown k-mer, predicts a first class for the first sample and a second class for the second sample based on the labels of the training set, maps the first sample and the second sample into the genome features space to determine, respectively, first values for the first samples for the features and second values for the second samples for the features, computes an output score for the first sample for each of the features and for the second sample for each of the features, wherein the output score is the difference between the respecti

Assignees

Inventors

Classifications

  • Sequence alignment; Homology search · CPC title

  • Clustering or classification · CPC title

  • Supervised data analysis · CPC title

  • Learning methods · CPC title

  • G16B50/10Primary

    Ontologies; Annotations · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12100486B2 cover?
Provided is a deep learning algorithm that analyzes fragments of biological sequences. The input for the deep learning algorithm is a biological sequence fragment of unknown origin and the output is the closest known biological genome that could share phenotypic properties with the biological species of unknown origin. The workflow thus has application for genomic classification, identification…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G16B50/10. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).