What technology area does this patent fall under?

Primary CPC classification G16B50/10. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Identification of unknown genomes and closest known genomes

US12100486B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12100486-B2
Application number	US-202117321371-A
Country	US
Kind code	B2
Filing date	May 14, 2021
Priority date	May 14, 2021
Publication date	Sep 24, 2024
Grant date	Sep 24, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Provided is a deep learning algorithm that analyzes fragments of biological sequences. The input for the deep learning algorithm is a biological sequence fragment of unknown origin and the output is the closest known biological genome that could share phenotypic properties with the biological species of unknown origin. The workflow thus has application for genomic classification, identification of mutations within known genomes, and the identification of the closest class for unknown species.

First claim

Opening claim text (preview).

We claim: 1. A computer-implemented method comprising: training a deep learning model via inputting a training set of biological sequence fragments into the deep learning model, wherein the deep learning model converts the biological sequence fragments into k-mers one-hot encoded into a binary matrix, wherein the deep learning model extracts feature vectors from the biological sequence fragments of the training set, wherein the deep learning model is implemented on a computer, wherein the deep learning model maps the extracted feature vectors into a genome features space to identify clusters of extracted features, wherein the deep learning model computes an intra-cluster density for each extracted feature from the training set mapped in the genome features space, wherein the training is supervised training that uses labels of the training set to respectively associate the extracted features with genomes of the training set, wherein the training comprises the deep learning model determining an average training value of a training distribution among each of the extracted features, and wherein the training comprises the deep learning model heuristically determining a respective first threshold value for each of the extracted features from the training set; and inputting a sample set of biological sequence fragments into the trained deep learning model, wherein the sample set of the biological sequence fragments comprises a first sample and a second sample, wherein, in response to the inputting of the sample set, the trained deep learning model: converts the first sample and the second sample into k-mers one-hot encoded into a respective binary matrix, classifies each k-mer of the first sample and the second sample as a known k-mer from the training set or an unknown k-mer, predicts a first class for the first sample and a second class for the second sample based on the labels of the training set, maps the first sample and the second sample into the genome features space to determine, respectively, first values for the first samples for the features and second values for the second samples for the features, computes an output score for the first sample for each of the features and for the second sample for each of the features, wherein the output score is the difference between the respective first or second value and the average training value for the respective feature for the predicted first or second class, respectively, computes a respective degree of divergence for the first sample for each of the features and for the second sample for each of the features, the respective degree of divergence being a difference between the respective output score of the first or second sample for the feature and the intra-cluster density for the feature, compares the degrees of divergence of the first and the second samples for each of the features to the respective first threshold values that were heuristically determined from the training set, in response to the comparing of the degrees of divergence for the first sample to the first threshold values not indicating an anomaly, provides for the first sample a classification output comprising the predicted first class, and in response to the comparing of the degrees of divergence for the second sample to the first threshold values indicating an anomaly, provides for the second sample a classification output comprising an indication of an anomaly and the predicted second class as a closest genome to the second sample. 2. The method of claim 1 , wherein the biological sequence fragments are selected from the group consisting of a genomic sequence, a gene sequence, a protein sequence, and a protein domain sequence. 3. The method of claim 2 , wherein the genomic sequence is a microbe genomic sequence. 4. The method of claim 1 , wherein the deep learning algorithm is a convolution neural network. 5. The method of claim 4 , wherein the convolution neural network comprises a max pooling algorithm. 6. The method of claim 1 , wherein the comparing of the degrees of divergence to the respective first threshold values comprises: counting a number of instances in which the degree of divergence for a feature exceeds the first threshold value for the feature, and comparing the number of instances to a second threshold value, wherein the comparing of the degrees of divergence to the first threshold value does not indicate an anomaly when the number of instances does not exceed the second threshold value, and wherein the comparing of the degrees of divergence to the first threshold value does not indicate an anomaly when the number of instances does not exceed the second threshold value, and wherein the comparing of the degrees of divergence to the first threshold value indicates the anomaly when the number of instances exceeds the second threshold value. 7. The method of claim 6 , wherein the second threshold value is determined heuristically with a linear search. 8. The method of claim 7 , wherein the heuristic determinations of the first and the second threshold values are determined under the constraint that a specific number of k-mers among the genomes of the training set are outliers. 9. A computer program product comprising one or more computer readable storage media and program instructions collectively stored on the one or more computer readable storage media, the program instructions comprising: program instructions for training a deep learning model via inputting a training set of biological sequence fragments into the deep learning model, wherein the deep learning model converts the biological sequence fragments into k-mers one-hot encoded into a binary matrix, wherein the deep learning model extracts feature vectors from the biological sequence fragments of the training set, wherein the deep learning model is implemented on a computer; wherein the deep learning model maps the extracted feature vectors into a genome features space to identify clusters of extracted features, wherein the deep learning model computes an intra-cluster density for each extracted feature from the training set mapped in the genome features space, wherein the training is supervised training that uses labels of the training set to respectively associate the extracted features with genomes of the training set, wherein the training comprises the deep learning model determining an average training value of a training distribution among each of the extracted features, and wherein the training comprises the deep learning model heuristically determining a respective first threshold value for each of the extracted features from the training set; and program instructions for inputting a sample set of biological sequence fragments into the deep learning model, wherein the sample set of the biological sequence fragments comprises a first sample and a second sample, wherein, in response to the inputting of the sample set, the trained deep learning model: converts the first sample and the second sample into k-mers one-hot encoded into a respective binary matrix, classifies each k-mer of the first sample and of the second sample as a known k-mer from the training set or an unknown k-mer, predicts a first class for the first sample and a second class for the second sample based on the labels of the training set, maps the first sample and the second sample into the genome features space to determine, respectively, first values for the first samples for the features and second values for the second samples for the features, computes an output score for the first sample for each of the features and for the second sample for each of the features, wherein the output score is the difference between the respecti

Assignees

Inventors

Classifications

G16B30/10
Sequence alignment; Homology search · CPC title
G06F16/285
Clustering or classification · CPC title
G16B40/20
Supervised data analysis · CPC title
G06N3/08
Learning methods · CPC title
G16B50/10Primary
Ontologies; Annotations · CPC title

Patent family

Related publications grouped by family.

View patent family 83947858

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12100486B2 cover?: Provided is a deep learning algorithm that analyzes fragments of biological sequences. The input for the deep learning algorithm is a biological sequence fragment of unknown origin and the output is the closest known biological genome that could share phenotypic properties with the biological species of unknown origin. The workflow thus has application for genomic classification, identification…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G16B50/10. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).