Using machine learning to determine electronic document similarity

US2020125648A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2020125648-A1
Application numberUS-201816168129-A
CountryUS
Kind codeA1
Filing dateOct 23, 2018
Priority dateOct 23, 2018
Publication dateApr 23, 2020
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and systems for using machine learning to determine electronic document similarity include extracting entities and corresponding relationships from each of two electronic documents of a corpus of electronic documents based on word embedding, computing an entity distance between the extracted entities and a relationship distance between the extracted relationships based on knowledge graph embedding, combining the entity and relationship distances to generate a similarity score between the electronic documents, and implementing the similarity score to perform a task associated with the electronic documents.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method for using machine learning to determine electronic document similarity, comprising: extracting entities and corresponding relationships from each of two electronic documents of a corpus of electronic documents based on word embedding; computing an entity distance between the extracted entities and a relationship distance between the extracted relationships based on knowledge graph embedding; combining the entity and relationship distances to generate a similarity score between the electronic documents; and implementing the similarity score to perform a task associated with the electronic documents. 2 . The method of claim 1 , further comprising training the word embedding and the knowledge graph embedding. 3 . The method of claim 1 , wherein extracting the entities and relationships further includes extracting the entities and relationships using a rule-based information extraction method. 4 . The method of claim 1 , wherein extracting the entities and relationships further includes extracting the entities and relationships using a deep learning method. 5 . The method of claim 1 , wherein the entity and relationship distances are computed based on Earth Mover's Distance. 6 . The method of claim 1 , wherein combining the entity and relationship distances to generate the similarity score further includes combining the entity and relationship distances to generate the similarity score as a weighted sum. 7 . The method of claim 1 , wherein implementing the similarity score further includes performing at least one action selected from the group consisting of: electronic document clustering to classify different types of electronic documents for quick review; a search for electronic documents based on the similarity score in response to receiving a search query; electronic document de-duplication based on the similarity score; and an electronic document answer provision based on the similarity score in response to receiving a question query associated with a question-answering system. 8 . A system for using machine learning to determine electronic document similarity, comprising: a memory device for storing program code; and at least one processor device operatively coupled to the memory device and configured to execute program code stored on the memory device to: extract entities and corresponding relationships from each of two electronic documents of a corpus of electronic documents based on word embedding; compute an entity distance between the extracted entities and a relationship distance between the extracted relationships based on knowledge graph embedding; combine the entity and relationship distances to generate a similarity score between the electronic documents; and implement the similarity score to perform a task associated with the electronic documents. 9 . The system of claim 8 , wherein the at least one processor is further configured to execute program code stored on the memory device to train the word embedding and the knowledge graph embedding. 10 . The system of claim 8 , wherein the at least one processor is further configured to extract the entities and relationships further by extracting the entities and relationships using at least one of a rule-based information extraction method and a deep learning method. 11 . The system of claim 8 , wherein the entity and relationship distances are computed based on Earth Mover's Distance. 12 . The system of claim 8 , wherein the at least one processor is further configured to combine the entity and relationship distances to generate the similarity score by combining the entity and relationship distances to generate the similarity score as a weighted sum. 13 . The system of claim 8 , wherein the at least one processor is further configured to implement the similarity score by performing at least one action selected from the group consisting of: electronic document clustering to classify different types of electronic documents for quick review; a search for electronic documents based on the similarity score in response to receiving a search query; electronic document de-duplication based on the similarity score; and an electronic document answer provision based on the similarity score in response to receipt of a question query associated with a question-answering system. 14 . A computer program product comprising a non-transitory computer readable storage medium having program code embodied therewith, the program code executable by a computer to cause the computer to perform a method for using machine learning to determine electronic document similarity, the method performed by the computer comprising: extracting entities and corresponding relationships from each of two electronic documents of a corpus of electronic documents based on word embedding; computing an entity distance between the extracted entities and a relationship distance between the extracted relationships based on knowledge graph embedding; combining the entity and relationship distances to generate a similarity score between the electronic documents; and implementing the similarity score to perform a task associated with the electronic documents. 15 . The computer program product of claim 14 , wherein the method further comprises training the word embedding and the knowledge graph embedding. 16 . The computer program product of claim 14 , wherein extracting the entities and relationships further includes extracting the entities and relationships using a rule-based information extraction method. 17 . The computer program product of claim 14 , wherein extracting the entities and relationships further includes extracting the entities and relationships using a deep learning method. 18 . The computer program product of claim 14 , wherein the entity and relationship distances are computed based on Earth Mover's Distance. 19 . The computer program product of claim 14 , wherein combining the entity and relationship distances to generate the similarity score further includes combining the entity and relationship distances to generate the similarity score as a weighted sum. 20 . The computer program product of claim 14 , wherein implementing the similarity score further includes performing at least one action selected from the group consisting of: electronic document clustering to classify different types of electronic documents for quick review; a search for electronic documents based on the similarity score in response to receiving a search query; electronic document de-duplication based on the similarity score; and an electronic document answer provision based on the similarity score in response to receiving a question query associated with a question-answering system.

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2020125648A1 cover?
Methods and systems for using machine learning to determine electronic document similarity include extracting entities and corresponding relationships from each of two electronic documents of a corpus of electronic documents based on word embedding, computing an entity distance between the extracted entities and a relationship distance between the extracted relationships based on knowledge grap…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/24578. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Apr 23 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).