What technology area does this patent fall under?

Primary CPC classification G06N20/00. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 07 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Identifying gene signatures and corresponding biological pathways based on an automatically curated genomic database

US11354591B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11354591-B2
Application number	US-201816157660-A
Country	US
Kind code	B2
Filing date	Oct 11, 2018
Priority date	Oct 11, 2018
Publication date	Jun 7, 2022
Grant date	Jun 7, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Mechanisms are provided to implement a genomic database curation (GDC) system. The GDC system generates a ground truth database based on a training subset of datasets from an uncurated large scale genomic database, and label metadata for the training subset. The GDC system trains at least one classification engine of the GDC system based on the training subset and the ground truth database at least by performing a machine learning operation on the at least one classification engine. The GDC system automatically applies the at least one trained classification engine on the uncurated large scale genomic database to generate an automatically curated large scale genomic database. A meta-classifier engine generates an output specifying at least one of significant gene signatures or gene pathways for at least one of diseases or drug agents based on the automatically curated large scale genomic database.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, performed by a data processing system comprising at least one processor and at least one memory, the at least one memory comprising instructions executed by the at least one processor to configure the at least one processor to implement a genomic database curation system, wherein the genomic database curation (GDC) system operates to perform the method which comprises: generating, by the GDC system, a ground truth database based on both a training subset of datasets, from an uncurated genomic database, and label metadata for the training subset; automatically training, by automatically executed training logic of the GDC system, a plurality of classification computer models of the GDC system based on the training subset and the ground truth database at least by executing machine learning on the plurality of classification computer models, to thereby generate a plurality of trained classification computer models; automatically executing, by the GDC system, the plurality of trained classification computer models on the uncurated genomic database to generate an automatically curated genomic database; and generating, by a meta-classifier computer model, an output specifying at least one of gene signatures or gene pathways for at least one of diseases or drug agents based on the automatically curated genomic database, wherein each classification computer model, in the plurality of classification computer models, during the automatic training of the classification computer model, iteratively executes on word embedding features of the training subset to perform a computer regression operation and train the classification computer model based on results of the computer regression operation and the ground truth database, wherein each classification computer model is configured and automatically trained to generate a different type of classification output from each other classification computer model in the plurality of classification computer models, and wherein the types of classification outputs comprise at least one disease type classification, at least one drug agent type classification, and at least one disease state binary class label type classification. 2. The method of claim 1 , wherein the uncurated genomic database comprises a plurality of gene expression signature datasets, each gene expression signature dataset being associated with a genomic study, and wherein each gene expression signature dataset comprises one or more sample entries. 3. The method of claim 2 , wherein the training subset of datasets is a subset of the gene expression signature datasets, and wherein the method further comprises pre-curating each gene expression signature dataset in the training subset to extract a subset of features from the content of the gene expression signature dataset for correlation with the label metadata. 4. The method of claim 1 , wherein the uncurated genomic database comprises gene expression signature datasets obtained from a plurality of different source computing devices, and wherein a plurality of the gene expression signature datasets from different source computing devices have differently formatted free-text portions of metadata and sample information content from each other. 5. The method of claim 1 , wherein the plurality of classification computer model comprises: one or more first classification computer models, each first classification computer model being associated with a different disease than other first classification computer models in the one or more first classification computer models, wherein each first classification computer model executes on word embeddings of an input dataset and automatically generates a first output specifying a first classification value indicating whether a study associated with an input dataset is directed to identifying a particular disease that the first classification computer model is machine learning trained to identify in input features associated with studies; one or more second classification computer models, each second classification computer model being associated with a different drug agent than other second classification computer models in the one or more second classification computer models, wherein each second classification computer model executes on the word embeddings of the input dataset and automatically generates a second output specifying a second classification value indicating whether a study associated with the input dataset involves the corresponding drug agent that the second classification computer model is machine learning trained to identify in the input features associated with studies; a third classification computer model that executes on word embeddings of the input dataset and automatically generates a third output specifying a third classification value indicating whether one or more particular samples referenced in the input dataset has a corresponding disease state or not, to thereby generate a disease state binary class label; and one or more fourth classification computer models, wherein each fourth classification computer model executes and automatically generates a fourth output specifying a results of evaluating samples at each time point after a drug agent administration. 6. The method of claim 5 , wherein each fourth classification computer model of the one or more fourth classification computer models evaluates a half maximal inhibitory concentration (IC 50 ) value of a drug agent at a time point after administration of the drug agent. 7. The method of claim 1 , wherein automatically executing the at least one trained classification computer model on the uncurated genomic database to generate an automatically curated genomic database comprises, for each uncurated dataset in the uncurated genomic database: executing computer natural language processing on the uncurated dataset to extract that extracts features from the uncurated dataset; processing, by the at least one trained classification computer model, the extracted features from the uncurated dataset to generate classification label metadata for the uncurated dataset; and storing the classification label metadata in association with the uncurated dataset to thereby generate a curated dataset. 8. The method of claim 1 , wherein generating, by the meta-classifier computer model, the output comprises: identifying a subset of curated datasets in the curated genomic database that corresponds to at least one of a particular disease or a particular drug agent; and performing a statistical analysis of the subset of curated datasets to identify gene signatures associated with the particular disease or drug agent. 9. The method of claim 8 , wherein generating, by the meta-classifier computer model, the output further comprises: combining, via one or more hierarchical random effect models of the meta-classifier computer model, separate datasets in the subset of curated datasets by merging individual signals of gene signatures of the individual datasets based on statistical scores associated with each of the gene signatures of the individual datasets and weight values associated with each of the individual datasets, wherein the weight values are based on a variance within each of the individual datasets. 10. The method of claim 8 , further comprising: receiving, from a client computing device, a user request specifying at least one of a disease or drug agent criteria for identifying gene signatures or gene pathways, wherein the subset of curated datasets is a subset of curated datasets corresponding to at least one of a disease or drug agent specified in the at least one of a disease or drug agent criteria of the user request, and wherein gene

Assignees

Inventors

Classifications

G16B20/00
ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations · CPC title
G16B50/30
Data warehousing; Computing architectures · CPC title
G16B40/00
ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding · CPC title
G06N20/00Primary
Machine learning · CPC title
G06F16/22
Indexing; Data structures therefor; Storage structures · CPC title

Patent family

Related publications grouped by family.

View patent family 70161390

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11354591B2 cover?: Mechanisms are provided to implement a genomic database curation (GDC) system. The GDC system generates a ground truth database based on a training subset of datasets from an uncurated large scale genomic database, and label metadata for the training subset. The GDC system trains at least one classification engine of the GDC system based on the training subset and the ground truth database at l…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 07 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).