Semantic graphing of heterogeneous documents for automated decision making and resource allocation using reinforcement learning
US-2023244990-A1 · Aug 3, 2023 · US
US11989506B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11989506-B2 |
| Application number | US-202217874855-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 27, 2022 |
| Priority date | Jul 27, 2022 |
| Publication date | May 21, 2024 |
| Grant date | May 21, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods of the present disclosure enable database search. The systems and/or methods may include receiving a search query that includes an input document having text. Word embeddings are generated within the input document, where the word embeddings include vector representations of words in the text of the input document. An average input document word embedding vector is determined for the word embeddings of the input document. A set of stored documents is accessed, where each stored document includes a stored text has a particular average stored document word embedding vector. A similarity model is used to determine a similarity metric measuring the similarity between the input document and each stored document based on the average input document word embedding vector and the particular average stored document word embedding vector of each stored document.
Opening claim text (preview).
What is claimed is: 1. A method comprising: accessing, by at least one processor, a training set of stored documents; wherein the training set of stored documents comprise: at least one existing pair of stored documents representing at least one pair of stored documents that are similar to each other, and at least one non-existing pair of stored documents representing at least one pair of stored documents that are not similar to each other; generating, by the at least one processor, a plurality of initial stored document word embeddings within each stored document of the set of stored documents; wherein the plurality of initial stored document word embeddings comprise a plurality of stored document vector representations of a plurality of words in text of each stored document; determining, by the at least one processor, an average stored document word embedding vector for the plurality of initial stored document word embeddings for each stored document; utilizing, by the at least one processor, a similarity model to determine a similarity metric of a similarity between a first stored document and a second stored document of each candidate pair of a plurality of candidate pairs of stored documents in the set of stored documents based at least in part on the average stored document word embedding vector of each of the first stored document and the second stored document; generating, by the at least one processor, a plurality of refined stored document word embeddings for each stored document in the set of stored documents by backpropagating an error of the similarity metric of each candidate pair, wherein the error is based at least in part on the at least one existing pair and the at least one non-existing pair; generating, by the at least one processor, a refined average stored document word embedding vector for the plurality of refined stored document word embeddings for each stored document; receiving, by the at least one processor, a search query from a computing device associated with a user; wherein the search query comprises an input document having text; generating, by the at least one processor, a plurality of input document word embeddings within the input document; wherein the plurality of input document word embeddings comprise a plurality of vector representations of a plurality of words in the text of the input document; determining, by the at least one processor, an average input document word embedding vector for the plurality of input document word embeddings for the input document; utilizing, by the at least one processor, the similarity model to determine an input document similarity metric of an input document similarity between the input document and each stored document in the set of stored documents based at least in part on the average input document word embedding vector and the refined average stored document word embedding vector of each stored document; and instructing, by the at least one processor, the computing device to display a ranked list of stored documents in response to the search query. 2. The method of claim 1 , wherein the similarity model comprises a cosine similarity determination. 3. The method of claim 1 , further comprising: utilizing, by the at least one processor, a word vectorization model to generate the plurality of input document word embeddings for the input document; receiving, by the at least one processor, a user selection confirming or denying the similarity metric of at least one stored document in the ranked list of stored documents; determining, by the at least one processor, a similarity error based at least in part on a difference according to an optimization function between: i) the user selection confirming or denying the similarity metric of the at least one stored document in the ranked list of stored documents, and ii) a ranked position of the at least one stored document within the ranked list of the stored documents; and training, by the at least one processor, parameters of the word vectorization model based at least in part on the similarity error. 4. The method of claim 1 , further comprising: receiving, by the at least one processor, a user selection confirming or denying the similarity metric of at least one stored document in the ranked list of stored documents; determining, by the at least one processor, a similarity error based at least in part on a difference according to an optimization function between: i) the user selection confirming or denying the similarity metric of the at least one stored document in the ranked list of stored documents, and ii) a ranked position of the at least one stored document within the ranked list of the stored documents; and training, by the at least one processor, parameters of the similarity model based at least in part on the similarity error. 5. The method of claim 1 , wherein the similarity model comprises an optimization objective to maximize the similarity metric between the input document and the set of stored documents. 6. The method of claim 5 , wherein the similarity model comprises at least one clustering model. 7. The method of claim 1 , further comprising: generating, by the at least one processor, a k-d tree of the set of stored documents; and determining, by the at least one processor, the ranked list of stored documents by using the similarity model to traverse the k-d tree. 8. The method of claim 1 , further comprising: receiving, by at least one processor, a new document having new text; generating, by the at least one processor, a plurality of new word embeddings for the new document; determining, by the at least one processor, a new average word embedding vector of the plurality of new word embeddings for the new document; and storing, by the at least one processor, the new document in the set of stored documents; wherein storing the new document in the set of stored documents comprises adding the new average word embedding vector to a cache of the stored average word embedding associated with the stored text of each stored document. 9. The method of claim 1 , wherein the average of the plurality of input document word embeddings comprises a weighted average based at least in part on a section of the text in which each word is located. 10. The method of claim 1 , further comprising: generating, by the at least one processor, a similarity alert based at least in part on the similarity metric of the input document to at least one stored document in the set of stored documents exceeding a predetermined similarity threshold; and causing, by the at least one processor, the computing device to produce the similarity alert to the user to alert the user of the at least one stored document. 11. The method of claim 1 , wherein the input document comprises a regulatory requirement document and the set of stored documents comprises a set of business controls documents. 12. The method of claim 1 , further comprising instructing at least one activity execution device, by the at least one processor, to execute at least one activity associated with the input document according to a highest ranked stored document in the ranked list of stored documents. 13. A system comprising: at least one processor configured to execute software instructions that cause the at least one processor to perform steps to: access a training set of stored documents; wherein the training set of stored documents comprise: at least one existing pair of stored documents representing at least one pair of stored documents that are similar to each other, and at least one non-existing pair of stored documents represe
Calculation of difference between files · CPC title
using vector based model · CPC title
Presentation of query results · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.