Machine learning solution to predict protein characteristics

US2024055100A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2024055100-A1
Application numberUS-202218146123-A
CountryUS
Kind codeA1
Filing dateDec 23, 2022
Priority dateAug 15, 2022
Publication dateFeb 15, 2024
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This disclosure provides a machine learning technique to predict a protein characteristic. A first training set is created that includes, for multiple proteins, a target feature, protein sequences, and other information about the proteins. A first machine learning model is trained and then used to identify which of the features are relevant as determined by feature importance or causal relationships to the target feature. A second training set is created with only the relevant features. Embeddings generated from the protein sequences are also added to the second training set. The second training set is used to train a second machine learning model. The first and second machine learning models may be any type of regressors. Once trained, the second machine learning model is used to predict a value for the target feature for an uncharacterized protein. The model of this disclosure provides 91% accuracy in predicting an ideal digestibility score.

First claim

Opening claim text (preview).

1 . A method comprising: receiving an indication of a protein sequence; obtaining other information for the protein sequence; determining physiochemical features from the protein sequence; generating embeddings from the protein sequence; providing the other information, the physiochemical features, and the embeddings to a trained machine learning model that is trained on a plurality of proteins with known values for a target feature; and generating, by the trained machine learning model, a predicted value of the target feature for the protein sequence. 2 . The method of claim 1 , further comprising, wherein the indication of the protein sequence is received from a computing device, providing the predicted value of the target feature to the computing device. 3 . The method of claim 1 , wherein the other information comprises nutritional information of a food item that contains a protein with the protein sequence. 4 . The method of claim 3 , wherein the nutritional information comprises energy content, dietary fiber amount, fat quantity, ash quantity, total sugar, calcium content, phosphorus content, sodium content, zinc content, copper content, and iron content. 5 . The method of claim 1 , wherein the physiochemical features comprise at least one of amino acid composition, pK value of carboxyl group, average weighed atomic number, degree based on atomic number in a graph, linker propensity from small dataset, normalized positional residue frequency at helix termini C, or hydration number. 6 . The method of claim 1 , wherein the embeddings are created by a transformer model. 7 . The method of claim 1 , wherein the trained machine learning model is a regressor. 8 . The method of claim 1 , wherein the target feature is digestibility, texture, or flavor. 9 . A method comprising: for each of a plurality of proteins, obtaining a protein sequence, a value for a target feature, and other information; creating a first training set from physiochemical features determined from the protein sequence, the value for the target feature, and the other information; training a first machine learning model using the first training set; identifying a subset of features used to train the first machine learning model as relevant features; generating embeddings from the protein sequence; creating a second training set from the relevant features and the embeddings; and training a second machine learning model with the second training set. 10 . The method of claim 9 , wherein the target feature is digestibility, texture, or flavor. 11 . The method of claim 9 , wherein the other information comprises nutritional information of food items that respectively contain one or more of the plurality of proteins. 12 . The method of claim 9 , wherein the physiochemical features comprise amino acid composition, pK value of carboxyl group, average weighed atomic number, degree based on atomic number in a graph, linker propensity from small dataset, normalized positional residue frequency at helix termini C, and hydration number. 13 . The method of claim 9 , wherein the first machine learning model comprises decision trees, random forest, or gradient boosting. 14 . The method of claim 9 , wherein identifying the subset of features that are the relevant features uses feature importance or causal relationships. 15 . The method of claim 14 , wherein identifying the subset of features that are the relevant features uses feature importance determined by Shapley values or causal relationships determined by Conditional Average Treatment Effect (CATE) or Individual Treatment Effect (ITE). 16 . The method of claim 9 , wherein the embeddings are generated by a transformer model. 17 . The method of claim 9 , wherein the second machine learning model is the same as the first machine learning model. 18 . The method of claim 9 , further comprising: receiving an indication of an uncharacterized protein; obtaining relevant other information for the uncharacterized protein; determining relevant physiochemical features from the sequence of the uncharacterized protein; generating embeddings from the uncharacterized protein; providing the relevant other information, the relevant physiochemical features, and the embeddings to the second machine learning model; and generating, by the second machine learning model, a predicted value of the target feature for the uncharacterized protein. 19 . A system comprising: one or more processing units; computer-readable media storing instructions; a feature extraction engine, implemented through execution of the instructions by the one or more processing units, configured to determine physiochemical features from a protein sequence; a first training set comprising, for each of a plurality of proteins, a value for a target feature, other information, and the physiochemical features; a first machine learning model, implemented through execution of the instructions by the one or more processing units, configured to learn a first correlation between the value for the target feature, the other information, and the physiochemical features; an embeddings engine, implemented through execution of the instructions by the one or more processing units, configured to generate embeddings from the protein sequence; a feature importance engine, implemented through execution of the instructions by the one or more processing units, configured to identify features used to train the first machine learning model that have at least a threshold importance to the predictive power of the first machine learning model; a second training set comprising, for each of the plurality of proteins, the value for a target feature, a subset of the features that have at least the threshold importance and a causal relationship to the target feature; and a second machine learning model, implemented through execution of the instructions by the one or more processing units, configured to learn a second correlation between the value for the target feature, the subset of the features, and the embeddings. 20 . The system of claim 19 , further comprising a network interface configured to receive an indication of an uncharacterized protein from a computing device and return to the computing device a predicted value for the target feature, the predicted value determined by providing the uncharacterized protein to the second machine learning model.

Assignees

Inventors

Classifications

  • G16H20/60Primary

    relating to nutrition control, e.g. diets · CPC title

  • Supervised data analysis · CPC title

  • Protein or domain folding · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2024055100A1 cover?
This disclosure provides a machine learning technique to predict a protein characteristic. A first training set is created that includes, for multiple proteins, a target feature, protein sequences, and other information about the proteins. A first machine learning model is trained and then used to identify which of the features are relevant as determined by feature importance or causal relation…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G16H20/60. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Feb 15 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).