Method of establishing cancer screening module, using method and platform thereof
US-2024402147-A1 · Dec 5, 2024 · US
US2025201349A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025201349-A1 |
| Application number | US-202418975313-A |
| Country | US |
| Kind code | A1 |
| Filing date | Dec 10, 2024 |
| Priority date | Dec 11, 2023 |
| Publication date | Jun 19, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for training and using a machine learning model to perform microprotein prediction. One computer-implemented method includes, accessing a set of data describing expressed amino acid sequences, and generating decoy training data that includes randomized amino acid sequences that are matched to properties of the amino acid sequences included in the set of data. The method then includes training a machine learning model using labeled training data and the decoy training data, the labeled training data including amino acid sequences within a set of classifications and the decoy training data constituting examples of amino acid sequences that are not expected to be within the set of classifications. The trained machine learning model is usable to receive an input that describes a structure of a particular amino acid sequence, and perform a classification of the particular amino acid sequence relative to the set of classifications.
Opening claim text (preview).
What is claimed is: 1 . A method, comprising: accessing, by a computer system, a set of data describing expressed amino acid sequences; generating, by the computer system, decoy training data that includes randomized amino acid sequences that are matched to properties of the amino acid sequences included in the set of data; and training, by the computer system, a machine learning model using labeled training data and the decoy training data, the labeled training data including amino acid sequences within a set of classifications and the decoy training data constituting examples of amino acid sequences that are not expected to be within the set of classifications; wherein the trained machine learning model is usable to: receive an input that describes a structure of a particular amino acid sequence; and perform a classification of the particular amino acid sequence relative to the set of classifications. 2 . The method of claim 1 , wherein the set of data includes information describing unlabeled amino acid sequences having unknown classifications. 3 . The method of claim 2 , wherein the labeled training data includes proteins from Swiss-Prot, and wherein the unlabeled amino acid sequences include Ribo-Seq derived ORFs, including GENCODE small open reading frames (smORFs) having unknown classifications. 4 . The method of claim 2 , further comprising: performing feature extraction on amino acid sequences described in the labeled training data and the decoy training data; and labeling, based on information derived from the feature extraction, the unlabeled amino acid sequences in the set of data to create additional labeled training data; wherein the machine learning model is also trained using the additional labeled training data. 5 . The method of claim 4 , wherein extracted features resulting from the feature extraction include nucleotide features and amino acid features. 6 . The method of claim 5 , wherein the nucleotide features include one or more of 4-mers of the 5′ UTR, 3′ UTR, first 50 CDS. 7 . The method of claim 5 , wherein the amino acid features include one or more of CTDD, CTD, APAAC, QSO. 8 . The method of claim 1 , wherein the randomized amino acid sequences included in the decoy training data are matched to lengths and codon probabilities of the expressed amino acid sequences included in the set of data. 9 . The method of claim 1 , wherein classes in the set of classifications include intracellular, secreted, and negative. 10 . The method of claim 1 , wherein the trained machine learning model is usable to: perform the classification of the particular amino acid sequence relative to the set of classifications by generating probabilities that the particular amino acid sequence is in one or more of the set of classifications. 11 . The method of claim 1 , wherein the particular amino acid sequence includes 150 or fewer amino acids. 12 . The method of claim 1 , further comprising: receiving, by the trained machine learning model, a description of the particular amino acid sequence; performing, by the trained machine learning model, a classification of the particular amino acid sequence relative to the set of classifications. 13 . A non-transitory computer-readable medium storing program instructions executable by a computer system to perform operations in a training mode that include: accessing labeled training data that describes amino acid sequences with known classifications; generating decoy training data that includes randomized amino acid sequences with properties that are matched to properties of actual amino acid sequences, the decoy training data constituting negative training examples; and training a machine learning model using the labeled training data, and the decoy training data, the trained machine learning model being usable to classify unknown amino acid sequences into one of a set of classifications. 14 . The non-transitory, computer-readable medium of claim 13 , wherein the randomized amino acid sequences included in the decoy training data are matched to lengths and codon probabilities of the actual amino acid sequences. 15 . The non-transitory, computer-readable medium of claim 14 , wherein the actual amino acid sequences include GENCODE small open reading frames (smORFs) having unknown classifications. 16 . The non-transitory, computer-readable medium of claim 13 , wherein the program instructions are executable by the computer system to perform operations in a prediction mode that include: receiving an input that describes a structure of a particular amino acid sequence; and using the trained machine learning model to perform a classification of the particular amino acid sequence into one of a set of classifications. 17 . The non-transitory, computer-readable medium of claim 16 , wherein the set of classifications are customizable by a user. 18 . A system, comprising: one or more processor circuits; memory storing program instructions executable by the one or more processor circuits to perform operations including, in a training mode: accessing a set of data describing expressed amino acid sequences; generating decoy training data that includes randomized amino acid sequences that are matched to properties of the amino acid sequences included in the set of data; and training a machine learning model using a labeled training set and the decoy training set, the labeled training set including proteins with known classes and the decoy training set constituting negative training examples, the trained machine learning model being usable to classify unknown proteins. 19 . The system of claim 18 , wherein the randomized amino acid sequences included in the decoy training data are matched to lengths and codon probabilities of the actual amino acid sequences. 20 . The system of claim 18 , the operations further including, in a prediction mode: receiving an input that describes a particular amino acid sequence; using the trained machine learning model to perform a classification of the particular amino acid sequence into one of a set of classifications. 21 . A method, comprising: receiving, by a computer system, an input that describes an amino acid sequence of unknown classification; using, by the computer system, a machine learning model to perform a classification of the amino acid sequence into one of a set of classifications; wherein the machine learning model is trained by a computer-implemented process that includes: accessing a set of data describing expressed amino acid sequences; generating decoy training data that includes randomized amino acid sequences that are matched to properties of the amino acid sequences included in the set of data; and training a machine learning model using labeled training data and the decoy training data, the labeled training data including amino acid sequences within a set of classifications and the decoy training data constituting examples of amino acid sequences that are not expected to be within the set of classifications. 22 . The method of claim 21 , wherein the randomized amino acid sequences included in the decoy training data are matched to lengths and codon probabilities of the actual amino acid sequences.
Related publications grouped by family.
Answers are generated from the same data shown on this page.