Methods and devices for quantifying text similarity

US2021174136A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021174136-A1
Application numberUS-202117181839-A
CountryUS
Kind codeA1
Filing dateFeb 22, 2021
Priority dateMay 21, 2019
Publication dateJun 10, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein are computer-implemented methods; computer-implemented systems; and non-transitory, computer-readable media, for quantifying text similarity. One computer-implemented method includes obtaining a plurality of shortest operation paths including one or more edit pairs for correcting an optical correction recognition (OCR) text string with an edit text string, where each of the one or more edit pairs denotes an operation performable to a character of the OCR text string during correction by the edit text string. A plurality of similarity scores is determined, each corresponding to one of the plurality of shortest operation paths and determined by summing historical similarity scores of the one or more edit pairs of each of the plurality of shortest operation paths. A minimum one of the plurality of similarity scores is selected to quantify text similarity between the OCR text string and the edit text string.

First claim

Opening claim text (preview).

1 . A computer-implemented device for quantifying text similarity, comprising: at least one processor; and memory storing computer program code which when executed by the at least one processor, cause the at least one processor to: obtain a plurality of shortest operation paths for correcting an optical correction recognition (OCR) text string with an edit text string, wherein each of the plurality of shortest operation paths includes one or more edit pairs, each of the one or more edit pairs denoting an operation performable to a character of the OCR text string during correction by the edit text string, determine a plurality of similarity scores, each of the plurality of similarity scores corresponding to one of the plurality of shortest operation paths, wherein each of the plurality of similarity scores is determined by summing historical similarity scores of the one or more edit pairs of each of the plurality of shortest operation paths, wherein when summing the historical similarity scores of the one or more edit pairs, the computer program code further causes the at least one processor to: retrieve the historical similarity scores of the one or more edit pairs from a history data library, add the edit pairs in the shortest operation path having the minimum similarity score into the history data library, update the historical similarity scores for the edit pairs in the history data library, wherein when updating the historical similarity scores, the computer program code further causes the at least one processor to: calculate frequencies of edit pairs in the history data library corresponding to the edit pairs in the shortest operation path having the minimum similarity score, and determine historical similarity scores for the edit pairs in the history data library corresponding to the edit pairs in the shortest operation path having the minimum similarity score by: performing a log frequency calculation for each of the edit pairs in the history data library corresponding to the edit pairs in the shortest operation path having the minimum similarity score and normalizing the log frequency calculations to a range of 0.0 to 1.0, and select a minimum one of the plurality of similarity scores to quantify text similarity between the OCR text string and the edit text string. 2 .- 4 . (canceled) 5 . The computer-implemented device of claim 1 , wherein, when obtaining the plurality of shortest operation paths, the computer program code further causes the at least one processor to: perform an edit distance calculation for correcting the OCR text string with the edit text string, wherein the operation performed on a character of the OCR text string during correction by the edit text string is one of an insertion operation, a deletion operation, or a substitution operation. 6 . The computer-implemented device of claim 1 , wherein the computer program code further causes the at least one processor to: determine that the minimum one of the plurality of similarity scores is below a predetermined threshold; and in response to determining that the minimum one of the plurality of similarity scores is below a predetermined threshold, correct the OCR text string with the edit text string. 7 . The computer-implemented device of claim 1 , wherein the computer program code further causes the at least one processor to: determine that the minimum one of the plurality of similarity scores is above a predetermined threshold; and in response to determining that the minimum one of the plurality of similarity scores is above the predetermined threshold, maintain the OCR text string. 8 . The computer-implemented device of claim 1 , wherein the computer program code further causes the at least one processor to: scan a digital image to capture the OCR text string, and capture the edit text string. 9 . A computer-implemented method for quantifying text similarity, comprising: obtaining a plurality of shortest operation paths for correcting an optical correction recognition (OCR) text string with an edit text string, wherein each of the plurality of shortest operation paths includes one or more edit pairs, each of the one or more edit pairs denoting an operation performable to a character of the OCR text string during correction by the edit text string; determining a plurality of similarity scores, each of the plurality of similarity scores corresponding to one of the plurality of shortest operation paths, wherein each of the plurality of similarity scores is determined by summing historical similarity scores of the one or more edit pairs of each of the plurality of shortest operation paths, wherein summing the historical similarity scores of the one or more edit pairs comprises: retrieving the historical similarity scores of the one or more edit pairs from a history data library, adding the edit pairs in the shortest operation path having the minimum similarity score into the history data library, updating the historical similarity scores for the edit pairs in the history data library, wherein updating the historical similarity scores comprises: calculating frequencies of edit pairs in the history data library corresponding to the edit pairs in the shortest operation path having the minimum similarity score, determining historical similarity scores for the edit pairs in the history data library corresponding to the edit pairs in the shortest operation path having the minimum similarity score by: performing a log frequency calculation for each of the edit pairs in the history data library corresponding to the edit pairs in the shortest operation path having the minimum similarity score and normalizing the log frequency calculations to a range of 0.0 to 1.0; and selecting a minimum one of the plurality of similarity scores to quantify text similarity between the OCR text string and the edit text string. 10 .- 12 . (canceled) 13 . The computer-implemented method of claim 9 , wherein the step of obtaining the plurality of shortest operation paths comprises performing an edit distance calculation for correcting the OCR text string with the edit text string, and wherein the operation performable to a character of the OCR text string during correction by the edit text string is one of an insertion operation, a deletion operation, or a substitution operation. 14 . The computer-implemented method of claim 9 , further comprising: determining that the minimum one of the similarity scores is below a predetermined threshold; and in response to determining that the minimum one of the plurality of similarity scores is below a predetermined threshold, correcting the OCR text string with the edit text string. 15 . The computer-implemented method of claim 14 , further comprising: determining that the minimum one of the similarity scores is above a predetermined threshold; and in response to determining that the minimum one of the similarity scores is above the predetermined threshold, maintaining the OCR text string. 16 . The computer-implemented method of claim 9 , further comprising: scanning a digital image to capture the OCR text string, and capturing the edit text string. 17 . A non-transitory, computer-readable medium storing one or more instructions executable by a computer system to perform one or more operations to index blockchain data for storage, comprising: obtaining a plurality of shortest operation paths for correcting an optical correction recognition (OCR) text string with an edit text string, wherein each of the plurality of shortest operation paths includes one or more edit pairs, each of the one or more edit pairs denoting an opera

Assignees

Inventors

Classifications

  • Syntactic or structural pattern recognition, e.g. symbolic string recognition · CPC title

  • Detection or correction of errors, e.g. by rescanning the pattern · CPC title

  • Proximity, similarity or dissimilarity measures · CPC title

  • G06F18/22Primary

    Matching criteria, e.g. proximity measures · CPC title

  • Character recognition · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021174136A1 cover?
Disclosed herein are computer-implemented methods; computer-implemented systems; and non-transitory, computer-readable media, for quantifying text similarity. One computer-implemented method includes obtaining a plurality of shortest operation paths including one or more edit pairs for correcting an optical correction recognition (OCR) text string with an edit text string, where each of the one…
Who is the assignee on this patent?
Advanced New Technologies Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06V30/1983. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jun 10 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).