Machine learning techniques for word-based text similarity determinations

US11941357B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11941357-B2
Application numberUS-202117355731-A
CountryUS
Kind codeB2
Filing dateJun 23, 2021
Priority dateJun 23, 2021
Publication dateMar 26, 2024
Grant dateMar 26, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Various embodiments of the present invention provide methods, apparatus, systems, computing devices, computing entities, and/or the like for performing text similarity determination. Certain embodiments of the present invention utilize systems, methods, and computer program products that perform text similarity determination by using at least one of Word Mover's Similarity measures, Relaxed Word Mover's Similarity measures, and Related Relaxed Word Mover's Similarity measures.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method comprising: generating, using one or more processors, a maximal word similarity score for a reference text data object and a target text data object, wherein: (i) the maximal word similarity score describes a maximal value of a transition cost value indicative of a measure of cost to transform a first embedded representation associated with one or more target words of the target text data object into a second embedded representation associated with one or more reference words of the reference text data object, and (ii) the transition cost value is determined based at least in part on, for each word pair comprising a reference word and a target word, a word-wise flow value and, a word-wise similarity value; generating, using the one or more processors, a predicted similarity score for the reference text data object and the target text data object based at least in part on the maximal word similarity score; and initiating, using the one or more processors, the performance of one or more prediction-based actions based at least in part on the predicted similarity score. 2. The computer-implemented method of claim 1 , wherein maximizing the transition cost value is performed in accordance with a maximization constraint requiring that a sum of each word-wise flow value for a particular reference word of the one or more reference words is equal to a document-wide word weight value for the particular reference word in the reference text data object. 3. The computer-implemented method of claim 2 , wherein the document-wide word weight value is determined based at least in part on: (i) a term frequency value of the particular reference word in the reference text data object, and (ii) a sum of each term frequency value for the one or more reference words in the reference text data object. 4. The computer-implemented method of claim 1 , wherein maximizing the transition cost value is performed in accordance with a maximization constraint requiring that a sum of each word-wise flow value for a particular target word of the one or more target words is equal to a document-wide word weight value for the particular target word in the target text data object. 5. The computer-implemented method of claim 4 , wherein the document-wide word weight value is determined based at least in part on: (i) a term frequency value of the particular target word in the target text data object, and (ii) a sum of each term frequency value for the one or more target words in the target text data object. 6. The computer-implemented method of claim 1 , wherein: the target text data object is selected from a plurality of candidate target text data objects, and the computer-implemented method comprises: generating, using the one or more processors and for each candidate target text data object of the plurality of candidate target text data objects other than the target text data object, a candidate maximal word similarity score; and generating, using the one or more processors, a ranked similarity list based at least in part on the maximal word similarity score and each candidate maximal word similarity score. 7. The computer-implemented method of claim 6 , wherein maximizing the transition cost value is performed in accordance with a maximization constraint requiring that a sum of each word-wise flow value for a particular target word of the one or more target words is equal to a document-wide word weight value for the particular target word in the target text data object. 8. The computer-implemented method of claim 7 , wherein the document-wide word weight value is determined based at least in part on: (i) a term frequency value of the particular target word in the target text data object, and (ii) a sum of each term frequency value for the one or more target words in the target text data object. 9. The computer-implemented method of claim 1 , wherein: the target text data object is selected from a plurality of candidate target text data objects, the plurality of candidate target text data objects are associated with a graph hierarchical structure, and generating the predicted similarity score comprises: generating a raw predicted similarity score for the target text data object based at least in part on the maximal word similarity score, traversing the graph hierarchical structure in accordance with a set of breadth first search iterations to identify to determine one or more sibling relationships for the target text data object, wherein each sibling relationship is associated with a second target text data object of the plurality of candidate target text data objects, and assigning a zero-valued predicted similarity score to the target text data object if at least one of the one or more sibling relationships is associated with a second target text data object that has a second raw predicted similarity score that exceeds the raw predicted similarity score of the target text data object. 10. The computer-implemented method of claim 1 , wherein determining each word-wise similarity value that is associated with a particular reference word and a particular target word comprises: determining whether the particular target word is in a threshold-satisfying target word list for the particular target word; and in response to determining that the particular target word is not in the threshold-satisfying target word list, determining the word-wise similarity value based at least in part on a predefined minimal word-wise similarity value. 11. A computing system comprising one or more processors and at least one memory including program code, the at least one memory and the program code configured to, with the one or more processors, cause the computing system to at least: generate a maximal word similarity score for a reference text data object and a target text data object, wherein: (i) the maximal word similarity score describes a maximal value of a transition cost value indicative of a measure of cost to transform a first embedded representation associated with one or more target words of the target text data object into a second embedded representation associated with one or more reference words of the reference text data object, and (ii) the transition cost value is determined based at least in part on, for each word pair comprising a reference word and a target word, a word-wise flow value and a word-wise similarity value; generate a predicted similarity score for the reference text data object and the target text data object based at least in part on the maximal word similarity score; and initiate the performance of one or more prediction-based actions based at least in part on the predicted similarity score. 12. The computing system of claim 11 , wherein maximizing the transition cost value is performed in accordance with a maximization constraint requiring that a sum of each word-wise flow value for a particular reference word of the one or more reference words is equal to a document-wide word weight value for the particular reference word in the reference text data object. 13. The computing system of claim 12 , wherein the document-wide word weight value is determined based at least in part on: (i) a term frequency value of the particular reference word in the reference text data object, and (ii) a sum of each term frequency value for the one or more reference words in the reference text data object. 14. The computing system of claim 11 , wherein maximizing the transition cost value is performed in accordance with a maximization constraint requiring that a sum of each word-wise flow value for a particular t

Assignees

Inventors

Classifications

  • G06F40/279Primary

    Recognition of textual entities · CPC title

  • by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation · CPC title

  • Matching criteria, e.g. proximity measures · CPC title

  • Hierarchical processing, e.g. outlines · CPC title

  • Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11941357B2 cover?
Various embodiments of the present invention provide methods, apparatus, systems, computing devices, computing entities, and/or the like for performing text similarity determination. Certain embodiments of the present invention utilize systems, methods, and computer program products that perform text similarity determination by using at least one of Word Mover's Similarity measures, Relaxed Wor…
Who is the assignee on this patent?
Optum Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/279. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 26 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).