System and method for transcribing historical records into digitized text
US-2016063321-A1 · Mar 3, 2016 · US
US9740928B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9740928-B2 |
| Application number | US-201514841519-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 31, 2015 |
| Priority date | Aug 29, 2014 |
| Publication date | Aug 22, 2017 |
| Grant date | Aug 22, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A handwriting recognition system converts word images on documents, such as document images of historical records, into computer searchable text. Word images (snippets) on the document are located, and have multiple word features identified. For each word image, a word feature vector is created representing multiple word features. Based on the similarity of word features (e.g., the distance between feature vectors), similar words are grouped together in clusters, and a centroid that has features most representative of words in the cluster is selected. A digitized text word is selected for each cluster based on review of a centroid in the cluster, and is assigned to all words in that cluster and is used as computer searchable text for those word images where they appear in documents. An analyst may review clusters to permit refinement of the parameters used for grouping words in clusters, including the adjustment of weights and other factors used for determining the distance between feature vectors.
Opening claim text (preview).
What is claimed is: 1. A method for creating digitized text for a record from an image of the record, comprising: receiving multiple word images from one or more records; for each received word image, identifying multiple word features of that word image; assigning one or more values to each of the multiple word features for each word image in order to create a feature vector associated with that word image; and assigning each word image to a word cluster based on its feature vector, comprising: calculating a distance between each one of the multiple word images and every other one of the multiple word images, based on feature vectors associated with those word images; selecting, from among the multiple word images, two of the word images that are closest in distance to each other; and assigning the two of the word images to the word cluster. 2. The method of claim 1 , further comprising: selecting a representative word image in the word cluster; selecting digitized text for the representative word image; and assigning the selected digitized text to each of the word images in the word cluster. 3. The method of claim 2 , wherein the step of selecting a representative word image comprises: determining a mean of the values assigned to the word features in order to create feature vectors for the two of the word images that are assigned to the word cluster; and selecting, as the representative word image, the one of the two word images having values in its associated feature vector closest to the mean. 4. The method of claim 2 , wherein the step of selecting a representative word image comprises: determining a mean of the values assigned to each of the multiple word features for each of the two word images that are assigned to the word cluster; creating a phantom word image that has exactly the same mean of values; and selecting, as the representative word image, the phantom word. 5. The method of claim 2 , further comprising: selecting, from among the multiple word images other than the assigned word images, an additional one of the multiple word images that is closest to the representative word image; assigning the additional one of the word images to the word cluster; and repeating the foregoing steps until a predetermined number of the multiple word images have been assigned to the word cluster. 6. The method of claim 2 , wherein the created feature vector comprises an array of elements, each element corresponding to one of the multiple word features. 7. The method of claim 6 , wherein the values assigned to each of the multiple words features comprise one of: a single value parameter; a string of numerical values; and multi-dimensional values comprising either (a) two coordinate values or (b) two coordinate values and a third value. 8. The method of claim 1 , wherein the step of calculating a distance comprises: calculating a plurality of distances, each being the difference between the value of one feature of the one of the word images and a corresponding feature of the other one of the multiple word images; and summing together the plurality of distances. 9. The method of claim 8 , wherein each of the plurality of distances is calculated using dynamic time warping. 10. The method of claim 8 , wherein the created feature vector further comprises a weight assigned to each of the multiple word features, and wherein the method further comprises using the weight when summing together the plurality of distances in order to calculate a distance between each one of the multiple word images and every other one of the multiple word images. 11. The method of claim 1 , wherein the record is a historical record having handwritten words, and wherein the multiple word images are each an image of one of the handwritten words. 12. The method of claim 1 , wherein a feature vector ID is assigned to the feature vector, identifying both the feature vector and its associated word image. 13. The method of claim 1 , wherein the multiple word features comprise at least two word features from a group comprising top line profile, bottom line profile, left line profile, right line profile, vertical projection profile, horizontal projection profile, peaks, valleys, watershed cup areas, watershed cap areas, loops and holes, intersections and crossings, stroke orientation, word aspect ratio, and convex hull. 14. A system for creating digitized text for a record from an image of the record, comprising: one or more processors; and a memory, the memory storing instructions that are executable by the one or more processors and configure the system to: receive multiple word images from one or more records; for each received word image, identify multiple word features of that word image; assign one or more values to each of the multiple word features for each word image in order to create a feature vector associated with that word image; and assign each word image to a word cluster based on its feature vector, wherein each word image is assigned to a word cluster based on its feature vector by: calculating a distance between each one of the multiple word images and every other one of the multiple word images, based on feature vectors associated with those word images; selecting, from among the multiple word images, two of the word images that are closest in distance to each other; and assigning the two of the word images to the word cluster. 15. The system of claim 14 , wherein the stored instructions further configure the system to: select a representative word image in the word cluster; select digitized text for the representative word image; and assign the selected digitized text to each of the word images in the word cluster. 16. The system of claim 15 , wherein a representative word image is selected by: determining a mean of the values assigned to the word features in order to create feature vectors for the two of the word images that are assigned to the word cluster; and selecting, as the representative word image, the one of the two word images having values in its associated feature vector closest to the mean. 17. The system of claim 15 , wherein a representative word image is selected by: determining a mean of the values assigned to each of the multiple word features for each of the two word images that are assigned to the word cluster; creating a phantom word image that has exactly the same mean of values; and selecting, as the representative word image, the phantom word. 18. The system of claim 15 , wherein the stored instructions further configure the system to: select, from among the multiple word images other than the assigned word images, an additional one of the multiple word images that is closest to the representative word image; assign the additional one of the word images to the word cluster; and repeat the foregoing steps until a predetermined number of the multiple word images have been assigned to the word cluster. 19. The system of claim 15 , wherein the created feature vector comprises an array of elements, each element corresponding to one of the multiple word features. 20. The system of claim 19 , wherein the values assigned to each of the multiple words features comprise one of: a single value parameter; a string of numerical values; and multi-dimensional values comprising either (a) two coordinate values or (b) two coordinate values and a third value. 21. The system of claim 14 , wherein a distance is calculated by: calculat
using clustering, e.g. of similar faces in social networks · CPC title
using word shape · CPC title
using recognition of characters or words · CPC title
Clustering techniques · CPC title
Matching criteria, e.g. proximity measures · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.