Implicit relation induction via purposeful overfitting of a word embedding model on a subset of a document corpus

US2019294695A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2019294695-A1
Application numberUS-201815928310-A
CountryUS
Kind codeA1
Filing dateMar 22, 2018
Priority dateMar 22, 2018
Publication dateSep 26, 2019
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method overfits a word vector generating process to identify implicit relationships between two or more terms in a corpus. A server identifies instances of multiple user-generated pairs of terms in an original corpus of documents, in which the terms are labeled but a relationship between two or more of the corpus terms are not identified. The server then extracts sentences, from the original corpus of documents, that contain one or more of the multiple user-generated pairs of terms, and combines the sentences into a training corpus, which is used to purposely overfit a word embedding model. This word embedding model leads to a vector that is used to identify other terms that have a same type of relationship as that found in the multiple user-generated pairs of terms, such that search corpus of documents can be searched for similar terms that trained the word embedding model.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: receiving, by a server, multiple user-generated pairs of terms, wherein each of the multiple user-generated pairs of terms comprises a first term and a second term; identifying, by the server, instances of the multiple user-generated pairs of terms as corpus terms in an original corpus of documents, wherein the corpus terms in the original corpus of documents are labeled with labels that describe the corpus terms without identifying a relationship between two or more of the corpus terms; extracting, by the server, passages, from the original corpus of documents, that contain one or more of the multiple user-generated pairs of terms; combining, by the server, the passages into a training corpus; training, by the server, a word vector generating process using the training corpus, wherein the word vector generating process generates a numerical vector for each term from the training corpus; determining, by the server, a vector difference between a first vector for the first term and a second vector for the second term; establishing, by the server, a relationship between the first term and the second term based on the vector difference between the first vector and the second vector; receiving, by the server, a request from a client computer for a document from a search corpus having a third term that has a same relationship to a fourth term as the relationship between the first term and the second term; reiteratively subtracting, by the server, one word vector from another word vector in the search corpus until a vector difference between two word vectors is within a predefined distance of the vector difference between the first vector and the second vector; and transmitting, from the server to the client computer, the document from the search corpus that contains two word vectors whose vector difference is within the predefined distance. 2 . The method of claim 1 , wherein the search corpus is the original corpus of documents. 3 . The method of claim 1 , wherein the search corpus is the training corpus. 4 . The method of claim 1 , wherein the search corpus is a group of documents that includes news releases, sales brochures, and academic papers that are not part of the original corpus of documents. 5 . The method of claim 1 , wherein the search corpus is a group of documents that includes news releases, sales brochures, and academic papers that are not part of the training corpus. 6 . The method of claim 1 , wherein the server and the client computer are connected by a network, and wherein the method further comprises: reducing bandwidth consumed by the network by transmitting, from the server to the client computer, only documents that contain term pairs whose vector differences are within the predefined distance of the vector difference between the first vector and the second vector. 7 . The method of claim 1 , wherein the server is a multi-processor computer that comprises a first processor that receives the multiple user-generated pairs of terms, a second processor that extracts the passages from the original corpus of documents that contain one or more of the multiple user-generated pairs of terms, and a third processor that trains the word vector generating process, wherein the method further comprises: storing, in a first cache in the first processor, only the multiple user-generated pairs of terms; storing, in a second cache in the second processor, only the passages from the original corpus of documents that contain one or more of the multiple user-generated pairs of terms; and storing, in a third cache in the third processor, only a first vector for the first term and a second vector for the second term for use in training the word vector generating process, wherein specifying what is only stored in the first cache, the second cache, and the third cache increase processing speed in the server. 8 . A computer program product comprising a non-transitory computer readable storage device having program instructions embodied therewith, the program instructions readable and executable by a computer to perform a method comprising: receiving, by a server, multiple user-generated pairs of terms, wherein each of the multiple user-generated pairs of terms comprises a first term and a second term; identifying, by the server, instances of the multiple user-generated pairs of terms as corpus terms in an original corpus of documents, wherein the corpus terms in the original corpus of documents are labeled with labels that describe the corpus terms without identifying a relationship between two or more of the corpus terms; extracting, by the server, sentences, from the original corpus of documents, that contain one or more of the multiple user-generated pairs of terms; combining, by the server, the sentences into a training corpus; training, by the server, a word vector generating process using the training corpus, wherein the word vector generating process generates a numerical vector for each term from the user-generated pairs of terms such that each first term A from the user-generated pairs of terms is assigned a version of a first numerical vector {right arrow over (A)} and each second term B from the user-generated pairs of terms is assigned a version of a second numerical vector {right arrow over (B)}; retrieving, by the server, a third term D from the original corpus of documents, wherein the third term D has a same label as the second term B; generating, by the server implementing the word vector generating process, a third numerical vector {right arrow over (D)} for the third term D; determining, by the server, a search numerical vector {right arrow over (C′)} based on {right arrow over (A)}−{right arrow over (B)}+{right arrow over (D)}={right arrow over (C′)}; determining, by the server and based on {right arrow over (A)}−{right arrow over (B)}={right arrow over (C′)}−{right arrow over (D)}, that {right arrow over (C′)} and {right arrow over (D)} have a same vector relationship as {right arrow over (A)} and {right arrow over (B)}; in response to determining that {right arrow over (C′)} and {right arrow over (D)} have a same vector relationship as {right arrow over (A)} and {right arrow over (B)}, determining, by the server, that a fourth term C and the third term D have a same relationship as a relationship between the first term A and the second term B; receiving, by the server, a request for documents that contain terms that identify an entity that has a relationship to another term that matches the relationship between the first term A and the second term B; comparing, by the server, previously generated term vectors for terms from a search corpus of documents to {right arrow over (C′)}; identifying, by the server, the previously generated term vectors that are within a predetermined vector distance of {right arrow over (C′)}; identifying, by the server, terms whose term vectors are within the predetermined vector distance of {right arrow over (C′)}, wherein the terms are from the search corpus; receiving, by the server, a request from a client computer for documents from the search corpus that contain terms that describe an entity that has the relationship to another term that matches the relationship between the first term A and the second term B; in response to receiving the request, retrieving, by the server, documents from the search corpus that contain at least one term whose assigned term vector is within the predetermined vector distance of {right arrow over (C′)}; and transmitting, from the server to the client computer, the documents from the search corpus that contain the at least one term whose assigned term vector is within the predetermined vector di

Assignees

Inventors

Classifications

  • Learning methods · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Combinations of networks · CPC title

  • Handling natural language data (speech analysis or synthesis, speech recognition G10L) · CPC title

  • Annotation, e.g. comment data or footnotes · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2019294695A1 cover?
A method overfits a word vector generating process to identify implicit relationships between two or more terms in a corpus. A server identifies instances of multiple user-generated pairs of terms in an original corpus of documents, in which the terms are labeled but a relationship between two or more of the corpus terms are not identified. The server then extracts sentences, from the original …
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/93. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Sep 26 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).