System and method for transcribing handwritten records using word groupings based on feature vectors

US9740928B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9740928-B2
Application numberUS-201514841519-A
CountryUS
Kind codeB2
Filing dateAug 31, 2015
Priority dateAug 29, 2014
Publication dateAug 22, 2017
Grant dateAug 22, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A handwriting recognition system converts word images on documents, such as document images of historical records, into computer searchable text. Word images (snippets) on the document are located, and have multiple word features identified. For each word image, a word feature vector is created representing multiple word features. Based on the similarity of word features (e.g., the distance between feature vectors), similar words are grouped together in clusters, and a centroid that has features most representative of words in the cluster is selected. A digitized text word is selected for each cluster based on review of a centroid in the cluster, and is assigned to all words in that cluster and is used as computer searchable text for those word images where they appear in documents. An analyst may review clusters to permit refinement of the parameters used for grouping words in clusters, including the adjustment of weights and other factors used for determining the distance between feature vectors.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for creating digitized text for a record from an image of the record, comprising: receiving multiple word images from one or more records; for each received word image, identifying multiple word features of that word image; assigning one or more values to each of the multiple word features for each word image in order to create a feature vector associated with that word image; and assigning each word image to a word cluster based on its feature vector, comprising: calculating a distance between each one of the multiple word images and every other one of the multiple word images, based on feature vectors associated with those word images; selecting, from among the multiple word images, two of the word images that are closest in distance to each other; and assigning the two of the word images to the word cluster. 2. The method of claim 1 , further comprising: selecting a representative word image in the word cluster; selecting digitized text for the representative word image; and assigning the selected digitized text to each of the word images in the word cluster. 3. The method of claim 2 , wherein the step of selecting a representative word image comprises: determining a mean of the values assigned to the word features in order to create feature vectors for the two of the word images that are assigned to the word cluster; and selecting, as the representative word image, the one of the two word images having values in its associated feature vector closest to the mean. 4. The method of claim 2 , wherein the step of selecting a representative word image comprises: determining a mean of the values assigned to each of the multiple word features for each of the two word images that are assigned to the word cluster; creating a phantom word image that has exactly the same mean of values; and selecting, as the representative word image, the phantom word. 5. The method of claim 2 , further comprising: selecting, from among the multiple word images other than the assigned word images, an additional one of the multiple word images that is closest to the representative word image; assigning the additional one of the word images to the word cluster; and repeating the foregoing steps until a predetermined number of the multiple word images have been assigned to the word cluster. 6. The method of claim 2 , wherein the created feature vector comprises an array of elements, each element corresponding to one of the multiple word features. 7. The method of claim 6 , wherein the values assigned to each of the multiple words features comprise one of: a single value parameter; a string of numerical values; and multi-dimensional values comprising either (a) two coordinate values or (b) two coordinate values and a third value. 8. The method of claim 1 , wherein the step of calculating a distance comprises: calculating a plurality of distances, each being the difference between the value of one feature of the one of the word images and a corresponding feature of the other one of the multiple word images; and summing together the plurality of distances. 9. The method of claim 8 , wherein each of the plurality of distances is calculated using dynamic time warping. 10. The method of claim 8 , wherein the created feature vector further comprises a weight assigned to each of the multiple word features, and wherein the method further comprises using the weight when summing together the plurality of distances in order to calculate a distance between each one of the multiple word images and every other one of the multiple word images. 11. The method of claim 1 , wherein the record is a historical record having handwritten words, and wherein the multiple word images are each an image of one of the handwritten words. 12. The method of claim 1 , wherein a feature vector ID is assigned to the feature vector, identifying both the feature vector and its associated word image. 13. The method of claim 1 , wherein the multiple word features comprise at least two word features from a group comprising top line profile, bottom line profile, left line profile, right line profile, vertical projection profile, horizontal projection profile, peaks, valleys, watershed cup areas, watershed cap areas, loops and holes, intersections and crossings, stroke orientation, word aspect ratio, and convex hull. 14. A system for creating digitized text for a record from an image of the record, comprising: one or more processors; and a memory, the memory storing instructions that are executable by the one or more processors and configure the system to: receive multiple word images from one or more records; for each received word image, identify multiple word features of that word image; assign one or more values to each of the multiple word features for each word image in order to create a feature vector associated with that word image; and assign each word image to a word cluster based on its feature vector, wherein each word image is assigned to a word cluster based on its feature vector by: calculating a distance between each one of the multiple word images and every other one of the multiple word images, based on feature vectors associated with those word images; selecting, from among the multiple word images, two of the word images that are closest in distance to each other; and assigning the two of the word images to the word cluster. 15. The system of claim 14 , wherein the stored instructions further configure the system to: select a representative word image in the word cluster; select digitized text for the representative word image; and assign the selected digitized text to each of the word images in the word cluster. 16. The system of claim 15 , wherein a representative word image is selected by: determining a mean of the values assigned to the word features in order to create feature vectors for the two of the word images that are assigned to the word cluster; and selecting, as the representative word image, the one of the two word images having values in its associated feature vector closest to the mean. 17. The system of claim 15 , wherein a representative word image is selected by: determining a mean of the values assigned to each of the multiple word features for each of the two word images that are assigned to the word cluster; creating a phantom word image that has exactly the same mean of values; and selecting, as the representative word image, the phantom word. 18. The system of claim 15 , wherein the stored instructions further configure the system to: select, from among the multiple word images other than the assigned word images, an additional one of the multiple word images that is closest to the representative word image; assign the additional one of the word images to the word cluster; and repeat the foregoing steps until a predetermined number of the multiple word images have been assigned to the word cluster. 19. The system of claim 15 , wherein the created feature vector comprises an array of elements, each element corresponding to one of the multiple word features. 20. The system of claim 19 , wherein the values assigned to each of the multiple words features comprise one of: a single value parameter; a string of numerical values; and multi-dimensional values comprising either (a) two coordinate values or (b) two coordinate values and a third value. 21. The system of claim 14 , wherein a distance is calculated by: calculat

Assignees

Inventors

Classifications

  • using clustering, e.g. of similar faces in social networks · CPC title

  • using word shape · CPC title

  • using recognition of characters or words · CPC title

  • Clustering techniques · CPC title

  • Matching criteria, e.g. proximity measures · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9740928B2 cover?
A handwriting recognition system converts word images on documents, such as document images of historical records, into computer searchable text. Word images (snippets) on the document are located, and have multiple word features identified. For each word image, a word feature vector is created representing multiple word features. Based on the similarity of word features (e.g., the distance bet…
Who is the assignee on this patent?
Ancestry Com Operations Inc
What technology area does this patent fall under?
Primary CPC classification G06V30/2264. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 22 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).