Feature selection for efficient epistasis modeling for phenotype prediction

US10108775B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10108775-B2
Application numberUS-201314030743-A
CountryUS
Kind codeB2
Filing dateSep 18, 2013
Priority dateJan 21, 2013
Publication dateOct 23, 2018
Grant dateOct 23, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Various embodiments select markers for modeling epistasis effects. In one embodiment, a processor receives a set of genetic markers and a phenotype. A relevance score is determined with respect to the phenotype for each of the set of genetic markers. A threshold is set based on the relevance score of a genetic marker with a highest relevancy score. A relevance score is determined for at least one genetic marker in the set of genetic markers for at least one interaction between the at least one genetic marker and at least one other genetic marker in the set of genetic markers. The at least one interaction is added to a top-k feature set based on the relevance score of the at least one interaction satisfying the threshold.

First claim

Opening claim text (preview).

What is claimed is: 1. An information processing system for reducing search time of features in sample data and reducing computation time when training a model for generating specialized data based on the features, the information processing system comprising: a memory; a processor communicatively coupled to the memory; and a feature selection circuit coupled to the memory and the processor, wherein the feature selection circuit is configured to perform a plurality of operations comprising: receiving a set of genetic markers and a phenotype from at least one external information processing system; training an epistasis effect model based on the set of genetic markers and the phenotype; reducing computation time of the information processing system during the training of the epistasis effect model and further reducing feature selection time of the epistasis effect model, wherein reducing the computation time and the feature selection time comprises: determining, for each of the set of genetic markers, a relevance score with respect to the phenotype according to I(x j training ;c training ), where I is mutual information between a given genetic marker x j and a phenotype c, where mutual information I between two variables x and y is defined, based on their joint marginal probabilities p(x) and p(y) and probabilistic distribution p(x, y), as: I ⁡ ( x , y ) = ∑ i , j ⁢ p ⁡ ( x i , y i ) ⁢ log ⁢ p ⁢ ( x i , y i ) p ⁡ ( x i ) ⁢ p ⁡ ( y i ) ; setting a threshold based on the relevance score of a genetic marker in the set of genetic markers with a highest relevancy score; determining, for at least one individual genetic marker in the set of genetic markers having a relevance score satisfying the threshold, a relevance score for at least one interaction between the at least one individual genetic marker and at least one other individual genetic marker in the set of genetic markers; adding the at least one interaction to a top-k feature set based on the relevance score of the at least one interaction satisfying the threshold, wherein the top-k feature set comprises one or more markers and one or more interactions, and wherein each of the one or more genetic markers and each of the one or more interactions comprises a top-k relevance score; identifying a subset of the top-k feature set based on the set of genetic markers, wherein each feature in the subset of the top-k feature set maximizes a relevancy with the phenotype and minimizes a redundancy with respect to other selected features; and training the epistasis effect model utilizing at least the subset of the top-k feature set, wherein the epistasis effect model predicts phenotypes for genetic markers; storing the trained epistasis effect model in memory; electronically obtaining, a new set of a set of genetic markers that is associated phenotype data; and executing the trained epistasis effect model, wherein executing the trained epistasis effect model comprises inputting the new set of genetic markers into the trained epistasis effect model; and outputting a phenotype for the new set of genetic markers, wherein the phenotype that was outputted was not made available to the feature selection circuit as part of the new set of genetic markers. 2. The information processing system of claim 1 , wherein the method further comprises: randomly sampling a subset of genetic markers from the set of genetic markers; and selecting the at least one additional genetic marker from the subset of genetic markers. 3. The information processing system of claim 2 , wherein determining the relevance score of the at least one interaction comprises: determining a first set of relevance scores comprising a relevance score with respect to the phenotype for each of a first plurality of interactions between the at least one genetic marker and each of the subset of genetic markers; determining, based on a normal distribution associated with the first set of relevance scores, a probability of the at least one genetic marker being associated with an interaction comprising a relevance score greater than the threshold; comparing the probability to a probability threshold; and determining, based on the probability satisfying the probability threshold, a second set of relevance scores comprising a relevance score for each of a second plurality of interactions between the at least one genetic marker and a remaining set of genetic markers in the set of genetic markers, wherein the second plurality of interactions comprises the at least one interaction, and wherein the remaining set of genetic markers comprises the at least one additional genetic marker. 4. The information processing system of claim 1 , wherein the method further comprises: generating, based on adding the at least one interaction to the top-k feature set, an updated top-k feature set by removing one of a genetic marker and an interaction associated with a lowest relevance score from the top-k feature set. 5. The information processing system of claim 1 , wherein the method further comprises: updating the thres

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • G06F19/12Primary

    Physics · mapped topic

  • Physics · mapped topic

  • G16B5/20Primary

    Probabilistic models · CPC title

  • ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10108775B2 cover?
Various embodiments select markers for modeling epistasis effects. In one embodiment, a processor receives a set of genetic markers and a phenotype. A relevance score is determined with respect to the phenotype for each of the set of genetic markers. A threshold is set based on the relevance score of a genetic marker with a highest relevancy score. A relevance score is determined for at least o…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F19/12. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 23 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).