Content item selection for goal achievement
US-12175387-B2 · Dec 24, 2024 · US
US2024055100A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2024055100-A1 |
| Application number | US-202218146123-A |
| Country | US |
| Kind code | A1 |
| Filing date | Dec 23, 2022 |
| Priority date | Aug 15, 2022 |
| Publication date | Feb 15, 2024 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
This disclosure provides a machine learning technique to predict a protein characteristic. A first training set is created that includes, for multiple proteins, a target feature, protein sequences, and other information about the proteins. A first machine learning model is trained and then used to identify which of the features are relevant as determined by feature importance or causal relationships to the target feature. A second training set is created with only the relevant features. Embeddings generated from the protein sequences are also added to the second training set. The second training set is used to train a second machine learning model. The first and second machine learning models may be any type of regressors. Once trained, the second machine learning model is used to predict a value for the target feature for an uncharacterized protein. The model of this disclosure provides 91% accuracy in predicting an ideal digestibility score.
Opening claim text (preview).
1 . A method comprising: receiving an indication of a protein sequence; obtaining other information for the protein sequence; determining physiochemical features from the protein sequence; generating embeddings from the protein sequence; providing the other information, the physiochemical features, and the embeddings to a trained machine learning model that is trained on a plurality of proteins with known values for a target feature; and generating, by the trained machine learning model, a predicted value of the target feature for the protein sequence. 2 . The method of claim 1 , further comprising, wherein the indication of the protein sequence is received from a computing device, providing the predicted value of the target feature to the computing device. 3 . The method of claim 1 , wherein the other information comprises nutritional information of a food item that contains a protein with the protein sequence. 4 . The method of claim 3 , wherein the nutritional information comprises energy content, dietary fiber amount, fat quantity, ash quantity, total sugar, calcium content, phosphorus content, sodium content, zinc content, copper content, and iron content. 5 . The method of claim 1 , wherein the physiochemical features comprise at least one of amino acid composition, pK value of carboxyl group, average weighed atomic number, degree based on atomic number in a graph, linker propensity from small dataset, normalized positional residue frequency at helix termini C, or hydration number. 6 . The method of claim 1 , wherein the embeddings are created by a transformer model. 7 . The method of claim 1 , wherein the trained machine learning model is a regressor. 8 . The method of claim 1 , wherein the target feature is digestibility, texture, or flavor. 9 . A method comprising: for each of a plurality of proteins, obtaining a protein sequence, a value for a target feature, and other information; creating a first training set from physiochemical features determined from the protein sequence, the value for the target feature, and the other information; training a first machine learning model using the first training set; identifying a subset of features used to train the first machine learning model as relevant features; generating embeddings from the protein sequence; creating a second training set from the relevant features and the embeddings; and training a second machine learning model with the second training set. 10 . The method of claim 9 , wherein the target feature is digestibility, texture, or flavor. 11 . The method of claim 9 , wherein the other information comprises nutritional information of food items that respectively contain one or more of the plurality of proteins. 12 . The method of claim 9 , wherein the physiochemical features comprise amino acid composition, pK value of carboxyl group, average weighed atomic number, degree based on atomic number in a graph, linker propensity from small dataset, normalized positional residue frequency at helix termini C, and hydration number. 13 . The method of claim 9 , wherein the first machine learning model comprises decision trees, random forest, or gradient boosting. 14 . The method of claim 9 , wherein identifying the subset of features that are the relevant features uses feature importance or causal relationships. 15 . The method of claim 14 , wherein identifying the subset of features that are the relevant features uses feature importance determined by Shapley values or causal relationships determined by Conditional Average Treatment Effect (CATE) or Individual Treatment Effect (ITE). 16 . The method of claim 9 , wherein the embeddings are generated by a transformer model. 17 . The method of claim 9 , wherein the second machine learning model is the same as the first machine learning model. 18 . The method of claim 9 , further comprising: receiving an indication of an uncharacterized protein; obtaining relevant other information for the uncharacterized protein; determining relevant physiochemical features from the sequence of the uncharacterized protein; generating embeddings from the uncharacterized protein; providing the relevant other information, the relevant physiochemical features, and the embeddings to the second machine learning model; and generating, by the second machine learning model, a predicted value of the target feature for the uncharacterized protein. 19 . A system comprising: one or more processing units; computer-readable media storing instructions; a feature extraction engine, implemented through execution of the instructions by the one or more processing units, configured to determine physiochemical features from a protein sequence; a first training set comprising, for each of a plurality of proteins, a value for a target feature, other information, and the physiochemical features; a first machine learning model, implemented through execution of the instructions by the one or more processing units, configured to learn a first correlation between the value for the target feature, the other information, and the physiochemical features; an embeddings engine, implemented through execution of the instructions by the one or more processing units, configured to generate embeddings from the protein sequence; a feature importance engine, implemented through execution of the instructions by the one or more processing units, configured to identify features used to train the first machine learning model that have at least a threshold importance to the predictive power of the first machine learning model; a second training set comprising, for each of the plurality of proteins, the value for a target feature, a subset of the features that have at least the threshold importance and a causal relationship to the target feature; and a second machine learning model, implemented through execution of the instructions by the one or more processing units, configured to learn a second correlation between the value for the target feature, the subset of the features, and the embeddings. 20 . The system of claim 19 , further comprising a network interface configured to receive an indication of an uncharacterized protein from a computing device and return to the computing device a predicted value for the target feature, the predicted value determined by providing the uncharacterized protein to the second machine learning model.
Related publications grouped by family.
Answers are generated from the same data shown on this page.