Phenotype/disease specific gene ranking using curated, gene library and network based data structures
US-2018095969-A1 · Apr 5, 2018 · US
US11354591B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11354591-B2 |
| Application number | US-201816157660-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 11, 2018 |
| Priority date | Oct 11, 2018 |
| Publication date | Jun 7, 2022 |
| Grant date | Jun 7, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Mechanisms are provided to implement a genomic database curation (GDC) system. The GDC system generates a ground truth database based on a training subset of datasets from an uncurated large scale genomic database, and label metadata for the training subset. The GDC system trains at least one classification engine of the GDC system based on the training subset and the ground truth database at least by performing a machine learning operation on the at least one classification engine. The GDC system automatically applies the at least one trained classification engine on the uncurated large scale genomic database to generate an automatically curated large scale genomic database. A meta-classifier engine generates an output specifying at least one of significant gene signatures or gene pathways for at least one of diseases or drug agents based on the automatically curated large scale genomic database.
Opening claim text (preview).
What is claimed is: 1. A method, performed by a data processing system comprising at least one processor and at least one memory, the at least one memory comprising instructions executed by the at least one processor to configure the at least one processor to implement a genomic database curation system, wherein the genomic database curation (GDC) system operates to perform the method which comprises: generating, by the GDC system, a ground truth database based on both a training subset of datasets, from an uncurated genomic database, and label metadata for the training subset; automatically training, by automatically executed training logic of the GDC system, a plurality of classification computer models of the GDC system based on the training subset and the ground truth database at least by executing machine learning on the plurality of classification computer models, to thereby generate a plurality of trained classification computer models; automatically executing, by the GDC system, the plurality of trained classification computer models on the uncurated genomic database to generate an automatically curated genomic database; and generating, by a meta-classifier computer model, an output specifying at least one of gene signatures or gene pathways for at least one of diseases or drug agents based on the automatically curated genomic database, wherein each classification computer model, in the plurality of classification computer models, during the automatic training of the classification computer model, iteratively executes on word embedding features of the training subset to perform a computer regression operation and train the classification computer model based on results of the computer regression operation and the ground truth database, wherein each classification computer model is configured and automatically trained to generate a different type of classification output from each other classification computer model in the plurality of classification computer models, and wherein the types of classification outputs comprise at least one disease type classification, at least one drug agent type classification, and at least one disease state binary class label type classification. 2. The method of claim 1 , wherein the uncurated genomic database comprises a plurality of gene expression signature datasets, each gene expression signature dataset being associated with a genomic study, and wherein each gene expression signature dataset comprises one or more sample entries. 3. The method of claim 2 , wherein the training subset of datasets is a subset of the gene expression signature datasets, and wherein the method further comprises pre-curating each gene expression signature dataset in the training subset to extract a subset of features from the content of the gene expression signature dataset for correlation with the label metadata. 4. The method of claim 1 , wherein the uncurated genomic database comprises gene expression signature datasets obtained from a plurality of different source computing devices, and wherein a plurality of the gene expression signature datasets from different source computing devices have differently formatted free-text portions of metadata and sample information content from each other. 5. The method of claim 1 , wherein the plurality of classification computer model comprises: one or more first classification computer models, each first classification computer model being associated with a different disease than other first classification computer models in the one or more first classification computer models, wherein each first classification computer model executes on word embeddings of an input dataset and automatically generates a first output specifying a first classification value indicating whether a study associated with an input dataset is directed to identifying a particular disease that the first classification computer model is machine learning trained to identify in input features associated with studies; one or more second classification computer models, each second classification computer model being associated with a different drug agent than other second classification computer models in the one or more second classification computer models, wherein each second classification computer model executes on the word embeddings of the input dataset and automatically generates a second output specifying a second classification value indicating whether a study associated with the input dataset involves the corresponding drug agent that the second classification computer model is machine learning trained to identify in the input features associated with studies; a third classification computer model that executes on word embeddings of the input dataset and automatically generates a third output specifying a third classification value indicating whether one or more particular samples referenced in the input dataset has a corresponding disease state or not, to thereby generate a disease state binary class label; and one or more fourth classification computer models, wherein each fourth classification computer model executes and automatically generates a fourth output specifying a results of evaluating samples at each time point after a drug agent administration. 6. The method of claim 5 , wherein each fourth classification computer model of the one or more fourth classification computer models evaluates a half maximal inhibitory concentration (IC 50 ) value of a drug agent at a time point after administration of the drug agent. 7. The method of claim 1 , wherein automatically executing the at least one trained classification computer model on the uncurated genomic database to generate an automatically curated genomic database comprises, for each uncurated dataset in the uncurated genomic database: executing computer natural language processing on the uncurated dataset to extract that extracts features from the uncurated dataset; processing, by the at least one trained classification computer model, the extracted features from the uncurated dataset to generate classification label metadata for the uncurated dataset; and storing the classification label metadata in association with the uncurated dataset to thereby generate a curated dataset. 8. The method of claim 1 , wherein generating, by the meta-classifier computer model, the output comprises: identifying a subset of curated datasets in the curated genomic database that corresponds to at least one of a particular disease or a particular drug agent; and performing a statistical analysis of the subset of curated datasets to identify gene signatures associated with the particular disease or drug agent. 9. The method of claim 8 , wherein generating, by the meta-classifier computer model, the output further comprises: combining, via one or more hierarchical random effect models of the meta-classifier computer model, separate datasets in the subset of curated datasets by merging individual signals of gene signatures of the individual datasets based on statistical scores associated with each of the gene signatures of the individual datasets and weight values associated with each of the individual datasets, wherein the weight values are based on a variance within each of the individual datasets. 10. The method of claim 8 , further comprising: receiving, from a client computing device, a user request specifying at least one of a disease or drug agent criteria for identifying gene signatures or gene pathways, wherein the subset of curated datasets is a subset of curated datasets corresponding to at least one of a disease or drug agent specified in the at least one of a disease or drug agent criteria of the user request, and wherein gene
ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations · CPC title
Data warehousing; Computing architectures · CPC title
ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding · CPC title
Machine learning · CPC title
Indexing; Data structures therefor; Storage structures · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.