Machine Learning Platform for Polygenic Models

US2025266129A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025266129-A1
Application numberUS-202519200097-A
CountryUS
Kind codeA1
Filing dateMay 6, 2025
Priority dateMay 27, 2020
Publication dateAug 21, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosed embodiments concern methods, apparatus, systems, and computer program products for developing polygenic risk score (PRS) models. In some implementations, a fully automated process is provided that allows for a PRS model to be defined by an initial set of parameters. In some implementations the PRS models are trained to provide a PRS for particular populations.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computing system comprising: one or more processors; cache memory; and mass storage memory containing computer-readable instructions that, when executed by the one or more processor, cause the computing system for perform operations comprising: based on genetic and condition data of a plurality of individuals, determining a population-specific genetic and condition dataset; determining, for the population-specific genetic and condition dataset: a plurality of population-specific single-nucleotide polymorphism (SNP) training sets that are statistically associated with a predetermined condition, and an SNP validation set; loading, into the cache memory, the plurality of population-specific SNP training sets and the SNP validation set; training, in parallel by accessing the cache memory, a plurality of population-specific machine learning models to predict respective probabilities of individuals exhibiting the predetermined condition based on the genetic and condition data of the individuals, wherein the plurality of population-specific machine learning models are trained using: the plurality of population-specific SNP training sets in the cache memory, correlations between the population-specific SNP training sets and the predetermined condition, and respective sets of parameters, wherein the respective sets of parameters are different for each of the plurality of population-specific machine learning models and include model hyperparameters used in training of the plurality of population-specific machine learning models; based on the SNP validation set in the cache memory, determining performance metrics for each of the population-specific machine learning models; and based on the performance metrics, selecting a particular machine learning model from the plurality of population-specific machine learning models, wherein the particular machine learning model is selected based on having a best performance metric of the plurality of population-specific machine learning models. 2 . The computing system of claim 1 , wherein the operations further comprise: training a new machine learning model to predict respective probabilities of the individuals exhibiting the predetermined condition based on the genetic and condition data of the individuals, wherein the new machine learning model is trained using: a population-specific SNP training set in the cache memory that was used in the training of the particular machine learning model, the SNP validation set in the cache memory, the correlations between the population-specific SNP training sets and the predetermined condition, and a particular set of the parameters that was used in the training of the particular machine learning model. 3 . The computing system of claim 1 , wherein the operations further comprise: determining that the genetic and condition data of a particular individual from the plurality of population-specific SNP training sets or the SNP validation set has been stored in the cache memory for more than a threshold period of time; and deleting, from the cache memory, the genetic and condition data of the particular individual. 4 . The computing system of claim 1 , wherein the operations further comprise: determining that the genetic and condition data of a particular individual from the plurality of population-specific SNP training sets or the SNP validation set is subject to a deletion request; and deleting, from the cache memory, the genetic and condition data of the particular individual. 5 . The computing system of claim 1 , wherein determining the plurality of population-specific SNP training sets and the SNP validation set comprises: dividing the population-specific genetic and condition dataset into at least the plurality of population-specific SNP training sets and the SNP validation set. 6 . The computing system of claim 1 , wherein the predetermined condition is obtained from a user of the computing system. 7 . The computing system of claim 1 , wherein the genetic and condition data of the plurality of individuals includes indications of presence or absence of the predetermined condition. 8 . The computing system of claim 1 , wherein the condition data of the plurality of individuals includes one or more of: answers to survey questions, family history, medical records, biomarkers, or data from one or more wearable sensors. 9 . The computing system of claim 1 , wherein the plurality of individuals includes greater than 10,000,000 individuals, and wherein the plurality of population-specific SNP training sets in the cache memory represent genetic data from between 100,000 and 1,000,000 individuals. 10 . The computing system of claim 9 , wherein the correlations are from a genome wide association study (GWAS) on the genetic data and the predetermined condition. 11 . The computing system of claim 1 , wherein the plurality of population-specific machine learning models comprise a population-specific machine learning model for one or more ethnicities of: European, African American, Sub-Saharan African, North Africa, LatinX, Central America, East Asian, South Asian, Southeast Asian, West Asian, and Central Asian. 12 . The computing system of claim 1 , wherein the plurality of population-specific SNP training sets represent individuals of European ethnicity, and wherein the SNP validation set represents individuals of Hispanic ethnicity. 13 . A computer-implemented method comprising: based on genetic and condition data of a plurality of individuals, determining a population-specific genetic and condition dataset; determining, for the population-specific genetic and condition dataset: a plurality of population-specific single-nucleotide polymorphism (SNP) training sets that are statistically associated with a predetermined condition, and an SNP validation set; loading, into a cache memory, the plurality of population-specific SNP training sets and the SNP validation set; training, in parallel by accessing the cache memory, a plurality of population-specific machine learning models to predict respective probabilities of individuals exhibiting the predetermined condition based on the genetic and condition data of the individuals, wherein the plurality of population-specific machine learning models are trained using: the plurality of population-specific SNP training sets in the cache memory, correlations between the population-specific SNP training sets and the predetermined condition, and respective sets of parameters, wherein the respective sets of parameters are different for each of the plurality of population-specific machine learning models and include model hyperparameters used in training of the plurality of population-specific machine learning models; based on the SNP validation set in the cache memory, determining performance metrics for each of the population-specific machine learning models; and based on the performance metrics, selecting a particular machine learning model from the plurality of population-specific machine learning models, wherein the particular machine learning model is selected based on having a best performance metric of the plurality of population-specific machine learning models. 14 . The computer-implemented method of claim 13 , further comprising: training a new machine learning model to predict respective probabilities of the individuals exhibiting the predetermined condition based on the genetic and condition data of the individuals, wherein the new machine learning model is trained using: a population-specific SNP training set in the cache memory that was used in the

Assignees

Inventors

Classifications

  • G16B40/00Primary

    ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding · CPC title

  • Machine learning · CPC title

  • for calculating health indices; for individual health risk assessment · CPC title

  • ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks · CPC title

  • for mining of medical data, e.g. analysing previous cases of other patients · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025266129A1 cover?
The disclosed embodiments concern methods, apparatus, systems, and computer program products for developing polygenic risk score (PRS) models. In some implementations, a fully automated process is provided that allows for a PRS model to be defined by an initial set of parameters. In some implementations the PRS models are trained to provide a PRS for particular populations.
Who is the assignee on this patent?
23Andme Inc
What technology area does this patent fall under?
Primary CPC classification G16B40/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Aug 21 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).