Molecular similarity search

US2021287762A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021287762-A1
Application numberUS-202117200836-A
CountryUS
Kind codeA1
Filing dateMar 14, 2021
Priority dateMar 16, 2020
Publication dateSep 16, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system for finding similar molecules to a query molecule includes a GCN, a PFS vector extractor, a compensated vector comparator (CVC) and a candidate vector selector. The GCN has been trained to output a molecular property vector from an input query or input candidate molecular vectors, respectively, The GCN transforms query atomic feature set (AFS) vectors and candidate AFS vectors into query property feature set (PFS) embedding vectors and candidate PFS embedding vectors. The PFS vector extractor extracts query PFS embedding vectors and candidate PFS embedding vectors from hidden layers of the trained GCN. The compensated vector comparator (CVC) calculates a compensated similarity metric (CSM) for at least one pair of query PFS embedding vector and one candidate PFS embedding vector. The candidate vector selector selects only such candidate molecular vectors.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for finding similar molecules to a query molecule, the method comprising: transforming query atomic feature set (AFS) vectors and candidate AFS vectors into query property feature set (PFS) embedding vectors and candidate PFS embedding vectors, utilizing a GCN that has been trained to output a molecular property vector from an input query or input candidate molecular vectors, respectively; extracting query and candidate PFS embedding vectors from hidden layers of said trained GCN; calculating a compensated similarity metric (CSM) for at least one pair of said query PFS embedding vector and one said candidate PFS embedding vector; and selecting only such said candidate molecular vectors which have a value of said CSM above a pre-defined threshold value. 2 . The method according to claim 1 wherein said compensating attempts to compensate for inaccuracies caused by a varying position of said atomic feature sets at an input layer of said trained GCN. 3 . The method according to claim 1 wherein said calculating comprises: for each candidate PFS embedding vector: summing all possible combinations of dot products between property feature sets in said query PFS embedding vector and property feature sets in said candidate PFS embedding vector; and normalizing said dot product sum, by dividing said dot product sum by the number of said property feature sets in said candidate PFS embedding vector. 4 . The method according to claim 1 wherein said trained GCN comprises an input layer, four hidden layers and an output layer. 5 . The method according to claim 1 wherein each said PFS embedding vector comprises a plurality of property feature sets. 6 . The method according to claim 1 wherein said trained GCN is trained to one of the following properties: solubility, blood brain barrier and toxicity. 7 . The method according to claim 4 wherein said extracting query and candidate PFS embedding vectors is performed at the output of the fourth said hidden layer. 8 . The method according to claim 1 wherein said candidate AFS vectors are vectors used to train said GCN. 9 . The method according to claim 1 wherein adjusting said predefined threshold value changes the number of said candidate molecular vectors deemed similar to said query molecular vector. 10 . A system for finding similar molecules to a query molecule, the system comprising: a GCN that has been trained to output a molecular property vector from an input query or input candidate molecular vectors, respectively, to transform query atomic feature set (AFS) vectors and candidate AFS vectors into query property feature set (PFS) embedding vectors and candidate PFS embedding vectors; a PFS vector extractor to extract query PFS embedding vectors and candidate PFS embedding vectors from hidden layers of said trained GCN; a compensated vector comparator (CVC) to calculate a compensated similarity metric (CSM) for at least one pair of said query PFS embedding vector and one said candidate PFS embedding vector; and a candidate vector selector to select only such said candidate molecular vectors which have a value of said CSM above a pre-defined threshold value. 11 . The system according to claim 10 wherein said compensated vector comparator (CVC) attempts to compensate for inaccuracies caused by a varying position of said atomic feature sets at an input layer of said trained GCN. 12 . The system according to claim 11 wherein said CVC comprises: a dot product summer to sum all possible combinations of dot products between property feature sets in said query PFS embedding vector and property feature sets in said candidate PFS embedding vector, for each candidate PFS embedding vector; and a DPS normalizer to normalize said DPS, by dividing said DPS by the number of said property feature sets in said candidate PFS embedding vector, for each candidate PFS embedding vector. 13 . The system according to claim 10 wherein said trained GCN comprises an input layer, four hidden layers and an output layer. 14 . The system according to claim 10 wherein each said PFS embedding vector comprises a plurality of property feature sets. 15 . The system according to claim 10 wherein said trained GCN is trained to one of the following properties: solubility, blood brain barrier and toxicity. 16 . The system according to claim 13 wherein said PFS vector extractor extracts query and candidate PFS embedding vectors from the output of the fourth said hidden layer. 17 . The system according to claim 10 wherein said candidate AFS vectors are vectors used to train said GCN. 18 . The system according to claim 10 wherein said candidate vector selector to change the value of said predefined threshold value in order to change the number of said candidate molecular vectors deemed similar to said query molecular vector.

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Supervised learning · CPC title

  • Learning methods · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021287762A1 cover?
A system for finding similar molecules to a query molecule includes a GCN, a PFS vector extractor, a compensated vector comparator (CVC) and a candidate vector selector. The GCN has been trained to output a molecular property vector from an input query or input candidate molecular vectors, respectively, The GCN transforms query atomic feature set (AFS) vectors and candidate AFS vectors into que…
Who is the assignee on this patent?
Gsi Technology Inc
What technology area does this patent fall under?
Primary CPC classification G16C20/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Sep 16 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).