Machine learning system with two encoder towers for semantic matching

US12191004B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12191004-B2
Application numberUS-202217850763-A
CountryUS
Kind codeB2
Filing dateJun 27, 2022
Priority dateJun 27, 2022
Publication dateJan 7, 2025
Grant dateJan 7, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This disclosure describes a machine learning system that includes a contrastive learning based two-tower model for retrieval of relevant chemical reaction procedures given a query chemical reaction. The two-tower model uses attention-based transformers and neural networks to convert tokenized representations of chemical reactions and chemical reaction procedures to embeddings in a shared embedding space. Each tower can include a transformer network, a pooling layer, a normalization layer, and a neural network. The model is trained with labeled data pairs that include a chemical reaction and the text of a chemical reaction procedure for that chemical reaction. New queries can locate chemical reaction procedures for performing a given chemical reaction as well as procedures for similar chemical reactions. The architecture and training of the model make it possible to perform semantic matching based on chemical structures. The model is highly accurate providing an average recall at K=5 of 95.9%.

First claim

Opening claim text (preview).

The invention claimed is: 1. A machine learning system for identifying one or more candidate chemical reaction procedures from a chemical reaction sketch, the system comprising: a processor; a memory comprising computer-readable instructions executable by the processor; a datastore comprising a corpus of chemical reaction procedures; an interface configured to receive the chemical reaction sketch from a user computing device; a reaction encoder configured to create a reaction embedding of the chemical reaction sketch, the reaction encoder comprising a reaction transformer network, a reaction pooling layer, a reaction normalization layer, and a reaction neural network; a procedure encoder configured to create procedure embeddings of the chemical reaction procedures in the corpus of chemical reaction procedures, the procedure encoder comprising a procedure transformer network, a procedure pooling layer, a procedure normalization layer, and a procedure neural network; a similarity-assessing mechanism configured to determine a similarity between the reaction embedding and the procedure embeddings in a shared embedding space; and an output mechanism configured to provide to the interface a predetermined number of candidate chemical reaction procedures from the corpus of chemical reaction procedures, the candidate chemical reaction procedures corresponding to procedure embeddings identified by the similarity-assessing mechanism as having the highest similarity to the reaction embedding. 2. The machine learning system of claim 1 , wherein the reaction transformer network comprises multiple layers each including a multi-head attention layer and a fully-connected feed-forward layer configured to generate a reaction transformer output and the procedure transformer network comprises multiple layers each including a multi-head attention layer and a fully-connected feed-forward layer configured to generate a procedure transformer output. 3. The machine learning system of claim 2 , wherein the reaction pooling layer generates a first high-dimensional vector from the reaction transformer output and the procedure pooling layer generates a second high-dimensional vector from the procedure transformer output. 4. The machine learning system of claim 3 , wherein the reaction normalization layer generates a first normalized vector from the first high-dimensional vector and the procedure normalization layer generates a second normalized vector from the second high-dimensional vector. 5. The machine learning system of claim 4 , wherein the reaction neural network and the procedure neural network are both fully connected, feed-forward, multilayer neural networks and the reaction neural network is configured to generate the reaction embedding from the first normalized vector and the procedure neural network is configured to generate the procedure embedding from the second normalized vector. 6. The machine learning system of claim 1 , wherein the reaction encoder and the procedure encoder are trained using contrasting learning on labeled pair-wise data of training chemical reaction procedures and training representations of chemical reactions, the training chemical reaction procedures provided to the procedure encoder and the training representations of chemical reactions provided to the reaction encoder. 7. A computer-implemented method of identifying one or more chemical reaction procedures from a chemical reaction sketch comprising: receiving from a user computing device the chemical reaction sketch; tokenizing the chemical reaction sketch to create a reaction token sequence; generating a reaction embedding from the reaction token sequence by a reaction encoder of a contrastive learning based two-tower model, the contrastive learning based two-tower model trained by contrastive loss on training data that includes training chemical reactions and training chemical reaction procedures for performing the training chemical reactions; determining similarity between the reaction embedding and procedure embeddings in a shared embedding space, the procedure embeddings generated by a procedure encoder of the contrastive learning based two-tower model from chemical reaction procedures in a corpus of chemical reaction procedures; and outputting a predetermined number of candidate chemical reaction procedures corresponding to procedure embeddings having a highest similarity to the reaction embedding. 8. The computer-implemented method of claim 7 , wherein the chemical reaction sketch is a simplified molecular-input line-entry system (SMILES) representation of all or part of a chemical reaction. 9. The computer-implemented method of claim 7 , wherein the similarity is a semantic similarity based on functional groups and carbon backbone structures. 10. The computer-implemented method of claim 7 , wherein the reaction encoder comprises: a transformer network that generates a transformer output from the reaction token sequence; a pooling layer that generates a high-dimensional vector from the transformer output; a normalization layer that generates a normalized vector from the high-dimensional vector; and a neural network that generates the reaction embedding in the shared embedding space from the normalized vector. 11. The computer-implemented method of claim 10 , wherein the transformer network has six layers, the pooling layer comprises a max pooler, the high-dimensional vector has 512 dimensions, and the neural network has two layers. 12. The computer-implemented method of claim 7 , wherein the procedure encoder comprises: a transformer network that generates a transformer output from procedure token sequences that are tokenizations of the chemical reaction procedures; a pooling layer that generates a high-dimensional vector from the transformer output; a normalization layer that generates a normalized vector from the high-dimensional vector; and a neural network that generates the procedure embeddings in the shared embedding space from the normalized vector. 13. The computer-implemented method of claim 12 , wherein the transformer network has six layers, the pooling layer comprises a max pooler, the high-dimensional vector has 512 dimensions, and the neural network has two layers. 14. A computer-implemented method of training a machine learning system for identifying chemical reaction procedures from chemical reaction sketches comprising: accessing training data from a training datastore, the training data comprising labeled data pairs of training chemical reactions and training chemical reaction procedures for performing the chemical reactions; tokenizing the training chemical reactions from the training data to create reaction token sequences; providing the reaction token sequences to a reaction encoder that generates reaction embeddings in a shared embedding space; tokenizing the training chemical reaction procedures from the training data to create procedure token sequences; providing the procedure token sequences to a procedure encoder that generates procedure embeddings in the shared embedding space; and training the reaction encoder and the procedure encoder with the training data by backpropagation to minimize a loss function between corresponding pairs of the reaction embeddings and the procedure embeddings. 15. The computer-implemented method of claim 14 , further comprising cleaning the training data by separating the training chemical reactions into reactants and products. 16. The computer-implemented method of claim 14 , further comprising cleaning the training data by removing any represent

Assignees

Inventors

Classifications

  • Architecture, e.g. interconnection topology · CPC title

  • Analysis or design of chemical reactions, syntheses or processes · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Combinations of networks · CPC title

  • Searching chemical structures or physicochemical data · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12191004B2 cover?
This disclosure describes a machine learning system that includes a contrastive learning based two-tower model for retrieval of relevant chemical reaction procedures given a query chemical reaction. The two-tower model uses attention-based transformers and neural networks to convert tokenized representations of chemical reactions and chemical reaction procedures to embeddings in a shared embedd…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G16C20/70. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 07 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).