Methods, systems, and software for identifying bio-molecules with interacting components

US9665694B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9665694-B2
Application numberUS-201414167709-A
CountryUS
Kind codeB2
Filing dateJan 29, 2014
Priority dateJan 31, 2013
Publication dateMay 30, 2017
Grant dateMay 30, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present invention provides methods for rapidly and efficiently searching biologically-related data space. More specifically, the present invention provides methods for identifying bio-molecules with desired properties, or which are most suitable for acquiring such properties, from complex bio-molecule libraries or sets of such libraries. The present invention also provides methods for modeling sequence-activity relationships, including but not limited to stepwise addition or subtraction techniques, Bayesian regression, ensemble regression and other methods. The present invention further provides digital systems and software for performing the methods provided herein.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for activity conducting directed evolution of one or more polypeptide or polynucleotide molecules, the method comprising: (a) receiving sequence data of a plurality of polypeptide molecules or a plurality of polynucleotide molecules encoding the plurality of polypeptide molecules, wherein the sequence data comprises identities and positions of a plurality of amino acids for each molecule of the plurality of polypeptide molecules or a plurality of nucleotides for each molecule of the plurality of polynucleotide molecules; (b) receiving activity data of the plurality of polypeptide molecules; (c) fitting a base model to the received sequence data and the received activity data, wherein the base model receives as one or more inputs one or more amino acids of a polypeptide molecule or one or more nucleotides of a polynucleotide molecule encoding the polypeptide molecule and provides as an output an activity of the polypeptide molecule, the base model includes either (i) a plurality of linear terms and no interaction term or (ii) a plurality of linear terms and one or more interaction terms, each linear term comprises a coefficient and an independent variable representing an amino acid or a nucleotide at a sequence position, and each interaction term comprises a coefficient and two or more independent variables representing two or more interacting amino acids at two or more sequence positions or nucleotides encoding the two or more interacting amino acids; (d) determining a predictive ability of the base model in predicting activity from the identities and the positions of the plurality of amino acids or the plurality of nucleotides, wherein the predictive ability is determined with a bias against including additional terms; (e) fitting at least one new model to the received sequence data and the received activity data, wherein the at least one new model is obtained by adding at least one new interaction term to the base model of (i) or (ii) or subtracting at least one new interaction term to or from the base model of (ii); (f) determining a predictive ability of the at least one new model in predicting activity from the identities and the positions of the plurality of amino acids or the plurality of nucleotides, wherein the predictive ability is determined with a bias against including additional terms; (g) selecting a model from among the base model and the at least one new model based on the predictive ability of the base model and the predictive ability of the at least one new model; (h) determining one or more amino acid sequences or one or more nucleic acid sequences using the selected model; (i) synthesizing one or more amino acid molecules or one or more nucleic acid molecules based on the one or more amino acid sequences or one or more nucleic acid sequences; and (j) recombining or performing mutagenesis on the one or more amino acid molecules or one or more nucleic acid molecules to provide the one or more polypeptide or polynucleotide molecules. 2. The method of claim 1 , wherein the at least one new model in (e) is produced by using prior information to determine posterior probability distributions of the new model. 3. The method of claim 2 , wherein the at least one new model is produced by using Gibbs sampling to fit a model to the sequence and activity data. 4. The method of claim 1 , wherein the at least one new model comprises two or more new models, each of which includes different interaction terms. 5. The method of claim 4 , further comprising preparing an ensemble model based on the two or more new models, wherein the ensemble model includes interaction terms from the two or more new models, and the interaction terms are weighted by the ability of the two or more new models to predict activity as determined in (d). 6. The method of claim 1 , further comprising, after (g): repeating (c)-(g) for one or more iterations using the selected model from (g) in place of the base model in (c) and adding or subtracting an interaction term that has not been added or subtracted in any selected model of any previous iteration. 7. The method of claim 1 , wherein the predictive ability of the at least one new model in predicting activity in (f) is measured by Akaike Information Criterion or Bayesian Information Criterion. 8. The method of claim 1 , wherein the plurality of polypeptide molecules constitutes a training set of a protein variant library. 9. The method of claim 1 , wherein the at least one new interaction term in (e) consists of one interaction term. 10. The method of claim 1 , wherein the one or more interaction terms of (b)(ii) comprise one or more interaction terms for a defined set of one or more combinations of two or more interacting amino acids or one or more interaction terms for a defined set of one or more combinations of nucleotides encoding the two or more interacting amino acids. 11. The method of claim 1 , wherein (h) comprises: selecting one or more mutations for a round of directed evolution by evaluating the coefficients of the two or more of the plurality of terms of the selected model to identify one or more defined amino acids or nucleotides at defined sequence positions that contribute to the activity; and determining a plurality of oligonucleotides containing or encoding the one or more mutations, wherein the plurality of oligonucleotides comprise at least portions of the one or more nucleic acid sequences. 12. The method of claim 11 , wherein selecting mutations for a round of directed evolution comprises identifying one or more coefficients that are determined to be larger than others of the coefficients, and selecting the defined amino acid or nucleotide at a defined position represented by the one or more coefficients so identified. 13. The method of claim 11 , further comprises synthesizing the plurality of oligonucleotides using a nucleic acid synthesizer. 14. The method of claim 1 , wherein (j) comprises fragmenting and recombining a polynucleotide molecule encoding a polypeptide molecule that is predicted by the selected model to have a desired level of activity. 15. The method of claim 1 , wherein (j) comprises performing saturation mutagenesis on a polypeptide molecule that is predicted by the selected model to have a desired level of activity. 16. The method of claim 1 , wherein (h) comprises: selecting one or more mutations by evaluating the coefficients of the selected model to identify one or more defined amino acids or nucleotides at defined sequence positions that contribute to the activity; and identifying a new protein or a new nucleic acid sequence comprising the one or more mutations. 17. The method of claim 1 , wherein (h) comprises: selecting one or more positions in an amino acid sequence or nucleic acid sequence by evaluating coefficients of the selected model to identify one or more defined amino acids or nucleotides at the one or more positions that contribute to the activity; and conducting saturation mutagenesis at the one or more positions. 18. A computer program product comprising one or more computer-readable non-transitory storage media having stored thereon computer-executable instructions that, when executed by one or more processors of a computer system, cause the computer system to implement a method for identifying biological molecules to affect a desired activity, the method comprising: (a) receiving sequence data of a plurality of polypeptide molecules or a plurality of polynucleotide molecules encoding the plurality of polypep

Assignees

Inventors

Classifications

  • ICT specially adapted for analysing two-dimensional [2D] or three-dimensional [3D] molecular structures, e.g. structural or functional relations or structure alignment · CPC title

  • G16C10/00Primary

    Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like · CPC title

  • ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations · CPC title

  • ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks · CPC title

  • In silico combinatorial chemistry · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9665694B2 cover?
The present invention provides methods for rapidly and efficiently searching biologically-related data space. More specifically, the present invention provides methods for identifying bio-molecules with desired properties, or which are most suitable for acquiring such properties, from complex bio-molecule libraries or sets of such libraries. The present invention also provides methods for model…
Who is the assignee on this patent?
Codexis Inc
What technology area does this patent fall under?
Primary CPC classification G16C10/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 30 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).