What technology area does this patent fall under?

Primary CPC classification G16C20/90. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 03 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System and method for retrieval-based controllable molecule generation

US12159694B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12159694-B2
Application number	US-202318353773-A
Country	US
Kind code	B2
Filing date	Jul 17, 2023
Priority date	Jul 15, 2022
Publication date	Dec 3, 2024
Grant date	Dec 3, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A machine learning framework is described for performing generation of candidate molecules for, e.g., drug discovery or other applications. The framework utilizes a pre-trained encoder-decoder model to interface between representations of molecules and embeddings for those molecules in a latent space. A fusion module is located between the encoder and decoder and is used to fuse an embedding for an input molecule with embeddings for one or more exemplary molecules selected from a database that is constructed according to a design criteria. The fused embedding is decoded using the decoder to generate a candidate molecule. The fusion module is trained to reconstruct a nearest neighbor to the input molecule from the database based on the sample of exemplary molecules. An iterative approach may be used during inference to dynamically update the database to include newly generated candidate molecules.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for determining, using a machine learning framework, a candidate molecule for satisfying a design criteria, the method comprising: receiving an input molecule data structure and the design criteria; selecting, based on the input molecule data structure and the design criteria, a plurality of exemplary molecule data structures from a database; providing the input molecule data structure as input to a trained encoder of the machine learning framework and processing, by the trained encoder, the input molecule data structure to generate an embedding of the input molecule data structure, wherein the embedding of the input molecule data structure is a numerical vector or tensor of a pre-defined dimension; providing the plurality of exemplary molecule data structures as input to the trained encoder and processing by the trained encoder, the plurality of exemplary molecule data structures to generate embeddings of the plurality of exemplary molecule data structures, wherein each respective embedding of a respective exemplary molecule data structure is a numerical vector or tensor of a pre-defined dimension; fusing, by a trained cross-attention mechanism, the embedding of the input molecule data structure and the embeddings of the exemplary molecule data structures to generate a fused embedding; and providing the fused embedding as input to a trained decoder of the machine learning framework and processing, by the trained decoder, the fused embedding to generate a candidate molecule data structure. 2. The method of claim 1 , wherein the encoder comprises a bidirectional encoder and the decoder comprises an autoregressive decoder. 3. The method of claim 2 , wherein the encoder and decoder are trained using a ZINC dataset. 4. The method of claim 1 , wherein the input molecule data structure and the plurality of exemplary molecule data structures are simplified molecular-input line-entry system (SMILES) string data structures. 5. The method of claim 1 , wherein the pre-trained cross-attention mechanism is trained in accordance with an objective to predict a nearest neighbor of the input molecule data structure in a training data set stored in the database, given as: (θ)=Σ i=1 CE ( DEC ( f CA ( e in (i) ,E r (i) ;θ)),x 1NN (i) ). 6. The method of claim 1 , wherein the selecting, based on the input molecule data structure and the design criteria, the plurality of exemplary molecule data structures from the database comprises: calculating, in accordance with a score function, a score value for each of a plurality of molecule data structures stored in the database; and selecting, via a retriever of the machine learning framework, K molecule data structures from the database as the plurality of exemplary molecule data structures, wherein the K exemplary molecule data structures are the K molecule data structures in the database having the top score values. 7. The method of claim 6 , wherein the design criteria specifies L properties for the candidate molecule, and wherein each molecule stored in the database has at least one predicted property value of L properties that is greater than a threshold value. 8. The method of claim 1 , wherein the encoder, the decoder, and the cross-attention mechanism comprise instructions configured to be executed by one or more processors of a computer device. 9. The method of claim 1 , further comprising: generating, based on the fused embedding, a plurality of perturbed embeddings by adding noise to the fused embedding; providing each perturbed embedding in of the plurality of perturbed embeddings to the trained decoder and processing, by the trained decoder, the plurality of perturbed embeddings to generate a plurality of second candidate molecule data structures; calculating a score value for each second candidate molecule data structure of the plurality of second candidate molecule data structures; and selecting a respective second candidate molecule data structure with the highest score value of the score values calculated for the second candidate molecule data structures as a best candidate molecule data structure. 10. The method of claim 9 , further comprising: calculating a score value for the input molecule data structure; comparing the score value for the input molecule data structure with the score value for the best candidate molecule data structure; in response to determining that the score value for the best candidate molecule data structure is greater than the score value for the input molecule data structure, updating the database by adding the best candidate molecule data structure to the database; and repeating the method for a new input molecule using the updated database. 11. The method of claim 1 , wherein providing the input molecule data structure and generating the embedding of the input molecule data structure is performed by a first instance of the trained encoder, wherein the providing the plurality of exemplary molecule data structures as input to the trained encoder and generating the embeddings of the plurality of exemplary molecule data structures is performed via a plurality of second instances of the trained encoder, and wherein the first instance of the trained encoder and the plurality of second instances of the trained encoder generate the embedding of the input molecule data structure and the embeddings of the plurality of exemplary molecule data structures in parallel. 12. The method of claim 1 , wherein (i) the providing the input molecule data structure as input to the trained encoder and generating the embedding of the input molecule data structure and (ii) the providing the plurality of exemplary molecule data structures as input to the trained encoder and generating the embeddings of the plurality of exemplary molecule data structures are performed sequentially using a single instance of the trained encoder. 13. A system for determining, using a machine learning framework, a candidate molecule for satisfying a design criteria, the system comprising: a memory storing a database containing a plurality of molecule data structures; and at least one processor, communicatively coupled to the memory, and the at least one processor being configured to: receive an input molecule data structure and the design criteria; select, based on the input molecule data structure and the design criteria, a plurality of exemplary molecule data structures from the database; provide the input molecule data structure as input to a trained encoder of the machine learning framework and process, via the trained encoder, the input molecule data structure to generate an embedding of the input molecule data structure, wherein the embedding of the input molecule data structure is a numerical vector or tensor of a pre-defined dimension; provide the plurality of exemplary molecule data structures as input to the trained encoder and process via the trained encoder, the plurality of exemplary molecule data structures to generate embeddings of the plurality of exemplary molecule data structures, wherein each respective embedding of a respective exemplary molecule data structure is a numerical vector or tensor of a pre-defined dimension; fuse, via a trained cross-attention mechanism, the embedding of the input molecule data structure and the embeddings of the exemplary molecule data structures to generate a fused embedding; and provide the fused embedding as input to a trained decoder of the machine learning framework and process, via the trained decoder, the fused embedding to generate a candidate molecule data structure. 14. The sy

Assignees

Nvidia Corp

Inventors

Classifications

G06N5/04
Inference or reasoning models · CPC title
G06N20/00
Machine learning · CPC title
G06N7/01
Probabilistic graphical models, e.g. probabilistic networks · CPC title
G06N20/10
using kernel methods, e.g. support vector machines [SVM] · CPC title
G16C20/10
Analysis or design of chemical reactions, syntheses or processes · CPC title

Patent family

Related publications grouped by family.

View patent family 89576863

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12159694B2 cover?: A machine learning framework is described for performing generation of candidate molecules for, e.g., drug discovery or other applications. The framework utilizes a pre-trained encoder-decoder model to interface between representations of molecules and embeddings for those molecules in a latent space. A fusion module is located between the encoder and decoder and is used to fuse an embedding fo…
Who is the assignee on this patent?: Nvidia Corp
What technology area does this patent fall under?: Primary CPC classification G16C20/90. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 03 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Systems and methods for materials discovery using duality transforms and predictive convex hulls

Systems and methods with machine learned dataset embedding for data fusion of material property datasets

Machine-learning method and apparatus to isolate chemical signatures

Deterministic decoder variational autoencoder

Chemical compound discovery using machine learning technologies

Frequently asked questions