System and Method for Parsing Regulatory and Other Documents for Machine Scoring Background
US-2024296188-A1 · Sep 5, 2024 · US
US2019294695A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2019294695-A1 |
| Application number | US-201815928310-A |
| Country | US |
| Kind code | A1 |
| Filing date | Mar 22, 2018 |
| Priority date | Mar 22, 2018 |
| Publication date | Sep 26, 2019 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method overfits a word vector generating process to identify implicit relationships between two or more terms in a corpus. A server identifies instances of multiple user-generated pairs of terms in an original corpus of documents, in which the terms are labeled but a relationship between two or more of the corpus terms are not identified. The server then extracts sentences, from the original corpus of documents, that contain one or more of the multiple user-generated pairs of terms, and combines the sentences into a training corpus, which is used to purposely overfit a word embedding model. This word embedding model leads to a vector that is used to identify other terms that have a same type of relationship as that found in the multiple user-generated pairs of terms, such that search corpus of documents can be searched for similar terms that trained the word embedding model.
Opening claim text (preview).
What is claimed is: 1 . A method comprising: receiving, by a server, multiple user-generated pairs of terms, wherein each of the multiple user-generated pairs of terms comprises a first term and a second term; identifying, by the server, instances of the multiple user-generated pairs of terms as corpus terms in an original corpus of documents, wherein the corpus terms in the original corpus of documents are labeled with labels that describe the corpus terms without identifying a relationship between two or more of the corpus terms; extracting, by the server, passages, from the original corpus of documents, that contain one or more of the multiple user-generated pairs of terms; combining, by the server, the passages into a training corpus; training, by the server, a word vector generating process using the training corpus, wherein the word vector generating process generates a numerical vector for each term from the training corpus; determining, by the server, a vector difference between a first vector for the first term and a second vector for the second term; establishing, by the server, a relationship between the first term and the second term based on the vector difference between the first vector and the second vector; receiving, by the server, a request from a client computer for a document from a search corpus having a third term that has a same relationship to a fourth term as the relationship between the first term and the second term; reiteratively subtracting, by the server, one word vector from another word vector in the search corpus until a vector difference between two word vectors is within a predefined distance of the vector difference between the first vector and the second vector; and transmitting, from the server to the client computer, the document from the search corpus that contains two word vectors whose vector difference is within the predefined distance. 2 . The method of claim 1 , wherein the search corpus is the original corpus of documents. 3 . The method of claim 1 , wherein the search corpus is the training corpus. 4 . The method of claim 1 , wherein the search corpus is a group of documents that includes news releases, sales brochures, and academic papers that are not part of the original corpus of documents. 5 . The method of claim 1 , wherein the search corpus is a group of documents that includes news releases, sales brochures, and academic papers that are not part of the training corpus. 6 . The method of claim 1 , wherein the server and the client computer are connected by a network, and wherein the method further comprises: reducing bandwidth consumed by the network by transmitting, from the server to the client computer, only documents that contain term pairs whose vector differences are within the predefined distance of the vector difference between the first vector and the second vector. 7 . The method of claim 1 , wherein the server is a multi-processor computer that comprises a first processor that receives the multiple user-generated pairs of terms, a second processor that extracts the passages from the original corpus of documents that contain one or more of the multiple user-generated pairs of terms, and a third processor that trains the word vector generating process, wherein the method further comprises: storing, in a first cache in the first processor, only the multiple user-generated pairs of terms; storing, in a second cache in the second processor, only the passages from the original corpus of documents that contain one or more of the multiple user-generated pairs of terms; and storing, in a third cache in the third processor, only a first vector for the first term and a second vector for the second term for use in training the word vector generating process, wherein specifying what is only stored in the first cache, the second cache, and the third cache increase processing speed in the server. 8 . A computer program product comprising a non-transitory computer readable storage device having program instructions embodied therewith, the program instructions readable and executable by a computer to perform a method comprising: receiving, by a server, multiple user-generated pairs of terms, wherein each of the multiple user-generated pairs of terms comprises a first term and a second term; identifying, by the server, instances of the multiple user-generated pairs of terms as corpus terms in an original corpus of documents, wherein the corpus terms in the original corpus of documents are labeled with labels that describe the corpus terms without identifying a relationship between two or more of the corpus terms; extracting, by the server, sentences, from the original corpus of documents, that contain one or more of the multiple user-generated pairs of terms; combining, by the server, the sentences into a training corpus; training, by the server, a word vector generating process using the training corpus, wherein the word vector generating process generates a numerical vector for each term from the user-generated pairs of terms such that each first term A from the user-generated pairs of terms is assigned a version of a first numerical vector {right arrow over (A)} and each second term B from the user-generated pairs of terms is assigned a version of a second numerical vector {right arrow over (B)}; retrieving, by the server, a third term D from the original corpus of documents, wherein the third term D has a same label as the second term B; generating, by the server implementing the word vector generating process, a third numerical vector {right arrow over (D)} for the third term D; determining, by the server, a search numerical vector {right arrow over (C′)} based on {right arrow over (A)}−{right arrow over (B)}+{right arrow over (D)}={right arrow over (C′)}; determining, by the server and based on {right arrow over (A)}−{right arrow over (B)}={right arrow over (C′)}−{right arrow over (D)}, that {right arrow over (C′)} and {right arrow over (D)} have a same vector relationship as {right arrow over (A)} and {right arrow over (B)}; in response to determining that {right arrow over (C′)} and {right arrow over (D)} have a same vector relationship as {right arrow over (A)} and {right arrow over (B)}, determining, by the server, that a fourth term C and the third term D have a same relationship as a relationship between the first term A and the second term B; receiving, by the server, a request for documents that contain terms that identify an entity that has a relationship to another term that matches the relationship between the first term A and the second term B; comparing, by the server, previously generated term vectors for terms from a search corpus of documents to {right arrow over (C′)}; identifying, by the server, the previously generated term vectors that are within a predetermined vector distance of {right arrow over (C′)}; identifying, by the server, terms whose term vectors are within the predetermined vector distance of {right arrow over (C′)}, wherein the terms are from the search corpus; receiving, by the server, a request from a client computer for documents from the search corpus that contain terms that describe an entity that has the relationship to another term that matches the relationship between the first term A and the second term B; in response to receiving the request, retrieving, by the server, documents from the search corpus that contain at least one term whose assigned term vector is within the predetermined vector distance of {right arrow over (C′)}; and transmitting, from the server to the client computer, the documents from the search corpus that contain the at least one term whose assigned term vector is within the predetermined vector di
Learning methods · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Combinations of networks · CPC title
Handling natural language data (speech analysis or synthesis, speech recognition G10L) · CPC title
Annotation, e.g. comment data or footnotes · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.