Systems, methods, and apparatuses to predict protein sequence and structure

US11923044B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11923044-B1
Application numberUS-202016896907-A
CountryUS
Kind codeB1
Filing dateJun 9, 2020
Priority dateJun 9, 2020
Publication dateMar 5, 2024
Grant dateMar 5, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for predicting a protein sequence are described. An exemplary method includes receiving a request to predict a missing area of a protein's primary sequence and a corresponding three-dimensional position of the missing area; applying a machine learning model to backbone Cartesian coordinates of the protein's primary sequence and a protein vector of a representation of the protein's primary sequence including the missing area to predict a missing area of the protein primary sequence and a corresponding three-dimensional position for the missing area, wherein the machine learning model is selected from the group consisting of: an attention-based machine learning model, a bidirectional long short term memory-based model, and a convolutional neural network-based model; and outputting a result of the machine learning model.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: receiving, at a protein sequence predictor comprising one or more processors, a request to predict a missing area of a protein primary sequence and a corresponding three-dimensional position of the missing area, the request including a representation of the protein primary sequence, backbone Cartesian coordinates for the protein primary sequence, and an indication of ablations in the protein primary sequence; conditioning the protein primary sequence and the backbone Cartesian coordinates for the protein primary sequence by: passing the representation of the protein primary sequence as input to an attention-based machine learning model of the protein sequence predictor and applying an embedding of the attention-based machine learning model to the representation of the protein primary sequence, obtaining output of a protein vector from the attention-based machine learning model, passing the backbone Cartesian coordinates as input to the protein sequence predictor to capture features in sequence space, and obtaining output of processed backbone Cartesian coordinates from the protein sequence predictor; combining the processed backbone Cartesian coordinates and the protein vector to generate a combined coordinate vector and protein vector; passing the combined coordinate vector and protein vector as input to the attention-based machine learning model; obtaining output of a prediction of the missing area of the protein primary sequence and the corresponding three-dimensional position of the missing area from the attention-based machine learning model; and generating a three-dimensional representation of the protein based on the output of the prediction of the missing area of the protein primary sequence and the corresponding three-dimensional position of the missing area. 2. The computer-implemented method of claim 1 , wherein the representation of the protein primary sequence uses an amino acid code consistent with International Union of Pure and Applied Chemistry usage. 3. The computer-implemented method of claim 1 , wherein the attention-based machine learning model is a transformer-based model. 4. A computer-implemented method comprising: receiving, at a protein sequence predictor comprising one or more processors, a request to predict a missing area of a protein primary sequence and a corresponding three-dimensional position of the missing area; passing as input to a machine learning model of the protein sequence predictor backbone Cartesian coordinates of the protein primary sequence and a protein vector of a representation of the protein primary sequence including the missing area, wherein the machine learning model is selected from the group consisting of: an attention-based machine learning model, a bidirectional long short term memory-based model, and a convolutional neural network-based model; and obtaining output of a prediction of the missing area of the protein primary sequence and the corresponding three-dimensional position of the missing area from the machine learning model. 5. The computer-implemented method of claim 4 , further comprising: applying an embedding of the machine learning model to the representation of the protein primary sequence to generate the protein vector. 6. The computer-implemented method of claim 5 , wherein the representation of the protein primary sequence is a character-based representation. 7. The computer-implemented method of claim 6 , wherein characters of the character-based representation conform to an amino acid code consistent with International Union of Pure and Applied Chemistry usage. 8. The computer-implemented method of claim 6 , wherein the request includes an indication of regions of ablation using a set of mask tokens in the representation of the protein primary sequence. 9. The computer-implemented method of claim 4 , wherein the request includes processed backbone Cartesian coordinates for the protein primary sequence and an embedded representation of the protein primary sequence as the protein vector. 10. The computer-implemented method of claim 4 , wherein the machine learning model is a transformer-based model. 11. The computer-implemented method of claim 4 , wherein the machine learning model is a convolutional neural network-based model comprising a stack of residual block layers. 12. The computer-implemented method of claim 4 , wherein the machine learning model is a long short term memory-based model comprising a stack of bidirectional long short term memory-based layers. 13. The computer-implemented method of claim 4 , further comprising: generating a 3-D representation from the output of the machine learning model. 14. The computer-implemented method of claim 4 , further comprising: combining the backbone Cartesian coordinates of the protein primary sequence and the protein vector of the representation of the protein primary sequence prior to passing the input to the machine learning model. 15. A system comprising: a first one or more electronic devices to implement a three-dimensional generation service in a multi-tenant provider network; and a second one or more electronic devices to implement a protein sequence predictor service in the multi-tenant provider network, the protein sequence predictor service including memory storing instructions that upon execution by one or more processors of the protein sequence predictor service, cause the protein sequence predictor service to: receive a request to predict a missing area of a protein primary sequence and a corresponding three-dimensional position of the missing area, pass as input to a machine learning model of the protein sequence predictor service backbone Cartesian coordinates of the protein primary sequence and a protein vector of a representation of the protein primary sequence including the missing area, wherein the machine learning model is selected from the group consisting of: an attention-based machine learning model, a bidirectional long short term memory-based model, and a convolutional neural network-based model, and obtain output of a prediction of the missing area of the protein primary sequence and the corresponding three-dimensional position of the missing area from the machine learning model, wherein the three-dimensional generation service is to generate a three-dimensional representation of the output. 16. The system of claim 15 , wherein the protein sequence predictor service is to apply an embedding of the machine learning model to the representation of the protein primary sequence to generate the protein vector. 17. The system of claim 16 , wherein the representation of the protein primary sequence is a character-based representation. 18. The system of claim 17 , wherein characters of the character-based representation conform to an amino acid code consistent with International Union of Pure and Applied Chemistry usage. 19. The system of claim 15 , wherein the request includes an indication of regions of ablation using a set of mask tokens in the representation of the protein primary sequence. 20. The system of claim 15 , wherein the request includes processed backbone Cartesian coordinates for the protein primary sequence and an embedded representation of the protein primary sequence as the protein vector.

Assignees

Inventors

Classifications

  • G16B15/00Primary

    ICT specially adapted for analysing two-dimensional [2D] or three-dimensional [3D] molecular structures, e.g. structural or functional relations or structure alignment · CPC title

  • Sequence assembly · CPC title

  • ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding · CPC title

  • G16B40/20Primary

    Supervised data analysis · CPC title

  • Protein or domain folding · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11923044B1 cover?
Techniques for predicting a protein sequence are described. An exemplary method includes receiving a request to predict a missing area of a protein's primary sequence and a corresponding three-dimensional position of the missing area; applying a machine learning model to backbone Cartesian coordinates of the protein's primary sequence and a protein vector of a representation of the protein's pr…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G16B15/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 05 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).