System and method for retrieval-based controllable molecule generation

US12159694B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12159694-B2
Application numberUS-202318353773-A
CountryUS
Kind codeB2
Filing dateJul 17, 2023
Priority dateJul 15, 2022
Publication dateDec 3, 2024
Grant dateDec 3, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A machine learning framework is described for performing generation of candidate molecules for, e.g., drug discovery or other applications. The framework utilizes a pre-trained encoder-decoder model to interface between representations of molecules and embeddings for those molecules in a latent space. A fusion module is located between the encoder and decoder and is used to fuse an embedding for an input molecule with embeddings for one or more exemplary molecules selected from a database that is constructed according to a design criteria. The fused embedding is decoded using the decoder to generate a candidate molecule. The fusion module is trained to reconstruct a nearest neighbor to the input molecule from the database based on the sample of exemplary molecules. An iterative approach may be used during inference to dynamically update the database to include newly generated candidate molecules.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for determining, using a machine learning framework, a candidate molecule for satisfying a design criteria, the method comprising: receiving an input molecule data structure and the design criteria; selecting, based on the input molecule data structure and the design criteria, a plurality of exemplary molecule data structures from a database; providing the input molecule data structure as input to a trained encoder of the machine learning framework and processing, by the trained encoder, the input molecule data structure to generate an embedding of the input molecule data structure, wherein the embedding of the input molecule data structure is a numerical vector or tensor of a pre-defined dimension; providing the plurality of exemplary molecule data structures as input to the trained encoder and processing by the trained encoder, the plurality of exemplary molecule data structures to generate embeddings of the plurality of exemplary molecule data structures, wherein each respective embedding of a respective exemplary molecule data structure is a numerical vector or tensor of a pre-defined dimension; fusing, by a trained cross-attention mechanism, the embedding of the input molecule data structure and the embeddings of the exemplary molecule data structures to generate a fused embedding; and providing the fused embedding as input to a trained decoder of the machine learning framework and processing, by the trained decoder, the fused embedding to generate a candidate molecule data structure. 2. The method of claim 1 , wherein the encoder comprises a bidirectional encoder and the decoder comprises an autoregressive decoder. 3. The method of claim 2 , wherein the encoder and decoder are trained using a ZINC dataset. 4. The method of claim 1 , wherein the input molecule data structure and the plurality of exemplary molecule data structures are simplified molecular-input line-entry system (SMILES) string data structures. 5. The method of claim 1 , wherein the pre-trained cross-attention mechanism is trained in accordance with an objective to predict a nearest neighbor of the input molecule data structure in a training data set stored in the database, given as: (θ)=Σ i=1 CE ( DEC ( f CA ( e in (i) ,E r (i) ;θ)),x 1NN (i) ). 6. The method of claim 1 , wherein the selecting, based on the input molecule data structure and the design criteria, the plurality of exemplary molecule data structures from the database comprises: calculating, in accordance with a score function, a score value for each of a plurality of molecule data structures stored in the database; and selecting, via a retriever of the machine learning framework, K molecule data structures from the database as the plurality of exemplary molecule data structures, wherein the K exemplary molecule data structures are the K molecule data structures in the database having the top score values. 7. The method of claim 6 , wherein the design criteria specifies L properties for the candidate molecule, and wherein each molecule stored in the database has at least one predicted property value of L properties that is greater than a threshold value. 8. The method of claim 1 , wherein the encoder, the decoder, and the cross-attention mechanism comprise instructions configured to be executed by one or more processors of a computer device. 9. The method of claim 1 , further comprising: generating, based on the fused embedding, a plurality of perturbed embeddings by adding noise to the fused embedding; providing each perturbed embedding in of the plurality of perturbed embeddings to the trained decoder and processing, by the trained decoder, the plurality of perturbed embeddings to generate a plurality of second candidate molecule data structures; calculating a score value for each second candidate molecule data structure of the plurality of second candidate molecule data structures; and selecting a respective second candidate molecule data structure with the highest score value of the score values calculated for the second candidate molecule data structures as a best candidate molecule data structure. 10. The method of claim 9 , further comprising: calculating a score value for the input molecule data structure; comparing the score value for the input molecule data structure with the score value for the best candidate molecule data structure; in response to determining that the score value for the best candidate molecule data structure is greater than the score value for the input molecule data structure, updating the database by adding the best candidate molecule data structure to the database; and repeating the method for a new input molecule using the updated database. 11. The method of claim 1 , wherein providing the input molecule data structure and generating the embedding of the input molecule data structure is performed by a first instance of the trained encoder, wherein the providing the plurality of exemplary molecule data structures as input to the trained encoder and generating the embeddings of the plurality of exemplary molecule data structures is performed via a plurality of second instances of the trained encoder, and wherein the first instance of the trained encoder and the plurality of second instances of the trained encoder generate the embedding of the input molecule data structure and the embeddings of the plurality of exemplary molecule data structures in parallel. 12. The method of claim 1 , wherein (i) the providing the input molecule data structure as input to the trained encoder and generating the embedding of the input molecule data structure and (ii) the providing the plurality of exemplary molecule data structures as input to the trained encoder and generating the embeddings of the plurality of exemplary molecule data structures are performed sequentially using a single instance of the trained encoder. 13. A system for determining, using a machine learning framework, a candidate molecule for satisfying a design criteria, the system comprising: a memory storing a database containing a plurality of molecule data structures; and at least one processor, communicatively coupled to the memory, and the at least one processor being configured to: receive an input molecule data structure and the design criteria; select, based on the input molecule data structure and the design criteria, a plurality of exemplary molecule data structures from the database; provide the input molecule data structure as input to a trained encoder of the machine learning framework and process, via the trained encoder, the input molecule data structure to generate an embedding of the input molecule data structure, wherein the embedding of the input molecule data structure is a numerical vector or tensor of a pre-defined dimension; provide the plurality of exemplary molecule data structures as input to the trained encoder and process via the trained encoder, the plurality of exemplary molecule data structures to generate embeddings of the plurality of exemplary molecule data structures, wherein each respective embedding of a respective exemplary molecule data structure is a numerical vector or tensor of a pre-defined dimension; fuse, via a trained cross-attention mechanism, the embedding of the input molecule data structure and the embeddings of the exemplary molecule data structures to generate a fused embedding; and provide the fused embedding as input to a trained decoder of the machine learning framework and process, via the trained decoder, the fused embedding to generate a candidate molecule data structure. 14. The sy

Assignees

Inventors

Classifications

  • Inference or reasoning models · CPC title

  • Machine learning · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • using kernel methods, e.g. support vector machines [SVM] · CPC title

  • Analysis or design of chemical reactions, syntheses or processes · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12159694B2 cover?
A machine learning framework is described for performing generation of candidate molecules for, e.g., drug discovery or other applications. The framework utilizes a pre-trained encoder-decoder model to interface between representations of molecules and embeddings for those molecules in a latent space. A fusion module is located between the encoder and decoder and is used to fuse an embedding fo…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G16C20/90. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 03 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).