System and method for predicting microproteins

US2025201349A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025201349-A1
Application numberUS-202418975313-A
CountryUS
Kind codeA1
Filing dateDec 10, 2024
Priority dateDec 11, 2023
Publication dateJun 19, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for training and using a machine learning model to perform microprotein prediction. One computer-implemented method includes, accessing a set of data describing expressed amino acid sequences, and generating decoy training data that includes randomized amino acid sequences that are matched to properties of the amino acid sequences included in the set of data. The method then includes training a machine learning model using labeled training data and the decoy training data, the labeled training data including amino acid sequences within a set of classifications and the decoy training data constituting examples of amino acid sequences that are not expected to be within the set of classifications. The trained machine learning model is usable to receive an input that describes a structure of a particular amino acid sequence, and perform a classification of the particular amino acid sequence relative to the set of classifications.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method, comprising: accessing, by a computer system, a set of data describing expressed amino acid sequences; generating, by the computer system, decoy training data that includes randomized amino acid sequences that are matched to properties of the amino acid sequences included in the set of data; and training, by the computer system, a machine learning model using labeled training data and the decoy training data, the labeled training data including amino acid sequences within a set of classifications and the decoy training data constituting examples of amino acid sequences that are not expected to be within the set of classifications; wherein the trained machine learning model is usable to: receive an input that describes a structure of a particular amino acid sequence; and perform a classification of the particular amino acid sequence relative to the set of classifications. 2 . The method of claim 1 , wherein the set of data includes information describing unlabeled amino acid sequences having unknown classifications. 3 . The method of claim 2 , wherein the labeled training data includes proteins from Swiss-Prot, and wherein the unlabeled amino acid sequences include Ribo-Seq derived ORFs, including GENCODE small open reading frames (smORFs) having unknown classifications. 4 . The method of claim 2 , further comprising: performing feature extraction on amino acid sequences described in the labeled training data and the decoy training data; and labeling, based on information derived from the feature extraction, the unlabeled amino acid sequences in the set of data to create additional labeled training data; wherein the machine learning model is also trained using the additional labeled training data. 5 . The method of claim 4 , wherein extracted features resulting from the feature extraction include nucleotide features and amino acid features. 6 . The method of claim 5 , wherein the nucleotide features include one or more of 4-mers of the 5′ UTR, 3′ UTR, first 50 CDS. 7 . The method of claim 5 , wherein the amino acid features include one or more of CTDD, CTD, APAAC, QSO. 8 . The method of claim 1 , wherein the randomized amino acid sequences included in the decoy training data are matched to lengths and codon probabilities of the expressed amino acid sequences included in the set of data. 9 . The method of claim 1 , wherein classes in the set of classifications include intracellular, secreted, and negative. 10 . The method of claim 1 , wherein the trained machine learning model is usable to: perform the classification of the particular amino acid sequence relative to the set of classifications by generating probabilities that the particular amino acid sequence is in one or more of the set of classifications. 11 . The method of claim 1 , wherein the particular amino acid sequence includes 150 or fewer amino acids. 12 . The method of claim 1 , further comprising: receiving, by the trained machine learning model, a description of the particular amino acid sequence; performing, by the trained machine learning model, a classification of the particular amino acid sequence relative to the set of classifications. 13 . A non-transitory computer-readable medium storing program instructions executable by a computer system to perform operations in a training mode that include: accessing labeled training data that describes amino acid sequences with known classifications; generating decoy training data that includes randomized amino acid sequences with properties that are matched to properties of actual amino acid sequences, the decoy training data constituting negative training examples; and training a machine learning model using the labeled training data, and the decoy training data, the trained machine learning model being usable to classify unknown amino acid sequences into one of a set of classifications. 14 . The non-transitory, computer-readable medium of claim 13 , wherein the randomized amino acid sequences included in the decoy training data are matched to lengths and codon probabilities of the actual amino acid sequences. 15 . The non-transitory, computer-readable medium of claim 14 , wherein the actual amino acid sequences include GENCODE small open reading frames (smORFs) having unknown classifications. 16 . The non-transitory, computer-readable medium of claim 13 , wherein the program instructions are executable by the computer system to perform operations in a prediction mode that include: receiving an input that describes a structure of a particular amino acid sequence; and using the trained machine learning model to perform a classification of the particular amino acid sequence into one of a set of classifications. 17 . The non-transitory, computer-readable medium of claim 16 , wherein the set of classifications are customizable by a user. 18 . A system, comprising: one or more processor circuits; memory storing program instructions executable by the one or more processor circuits to perform operations including, in a training mode: accessing a set of data describing expressed amino acid sequences; generating decoy training data that includes randomized amino acid sequences that are matched to properties of the amino acid sequences included in the set of data; and training a machine learning model using a labeled training set and the decoy training set, the labeled training set including proteins with known classes and the decoy training set constituting negative training examples, the trained machine learning model being usable to classify unknown proteins. 19 . The system of claim 18 , wherein the randomized amino acid sequences included in the decoy training data are matched to lengths and codon probabilities of the actual amino acid sequences. 20 . The system of claim 18 , the operations further including, in a prediction mode: receiving an input that describes a particular amino acid sequence; using the trained machine learning model to perform a classification of the particular amino acid sequence into one of a set of classifications. 21 . A method, comprising: receiving, by a computer system, an input that describes an amino acid sequence of unknown classification; using, by the computer system, a machine learning model to perform a classification of the amino acid sequence into one of a set of classifications; wherein the machine learning model is trained by a computer-implemented process that includes: accessing a set of data describing expressed amino acid sequences; generating decoy training data that includes randomized amino acid sequences that are matched to properties of the amino acid sequences included in the set of data; and training a machine learning model using labeled training data and the decoy training data, the labeled training data including amino acid sequences within a set of classifications and the decoy training data constituting examples of amino acid sequences that are not expected to be within the set of classifications. 22 . The method of claim 21 , wherein the randomized amino acid sequences included in the decoy training data are matched to lengths and codon probabilities of the actual amino acid sequences.

Assignees

Inventors

Classifications

  • G16B40/20Primary

    Supervised data analysis · CPC title

  • G16B30/00Primary

    ICT specially adapted for sequence analysis involving nucleotides or amino acids · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025201349A1 cover?
Techniques for training and using a machine learning model to perform microprotein prediction. One computer-implemented method includes, accessing a set of data describing expressed amino acid sequences, and generating decoy training data that includes randomized amino acid sequences that are matched to properties of the amino acid sequences included in the set of data. The method then includes…
Who is the assignee on this patent?
Res Found Dev
What technology area does this patent fall under?
Primary CPC classification G16B40/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jun 19 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).