System and method for transcribing historical records into digitized text

US9767353B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9767353-B2
Application numberUS-201514841502-A
CountryUS
Kind codeB2
Filing dateAug 31, 2015
Priority dateAug 29, 2014
Publication dateSep 19, 2017
Grant dateSep 19, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A handwriting recognition system converts word images on documents, such as document images of historical records, into computer searchable text. Word images (snippets) on the document are located, and have multiple word features identified. For each word image, a word feature vector is created representing multiple word features. Based on the similarity of word features (e.g., the distance between feature vectors), similar words are grouped together in clusters, and a centroid that has features most representative of words in the cluster is selected. A digitized text word is selected for each cluster based on review of a centroid in the cluster, and is assigned to all words in that cluster and is used as computer searchable text for those word images where they appear in documents. An analyst may review clusters to permit refinement of the parameters used for grouping words in clusters, including the adjustment of weights and other factors used for determining the distance between feature vectors.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for creating digitized text for a record from an image of the record, comprising: obtaining one or more digital images of a record; evaluating the digital images in order to locate each of multiple word images; for each located word image, identifying multiple word features of that word image; assigning, to one group of word images, designated ones of the multiple word images based on the distance between the multiple word images, the distance representing the similarity of word features between at least two of the word images; selecting a representative word image in the one group of word images, by calculating, at a word clustering system, word feature values for the word features of each of the multiple word images assigned to the one group of word images, and using the word feature values to determine, at the word clustering system, a word image that is representative of the word images in the one group of word images; selecting digitized text for the representative word image; and assigning the selected digitized text to each of the word images in the one group of word images. 2. The method of claim 1 , wherein the record is a historical record having handwritten words, and wherein the multiple word images are each an image of one of the handwritten words. 3. The method of claim 1 , wherein the word images each have corresponding word features, and wherein the method further comprises: assigning a value to each of the word features; and calculating a distance between the at least two of the word images, based on the difference between values of corresponding word features of those two word images. 4. The method of claim 3 , wherein the word images each have corresponding word features, and wherein the method further comprises: assigning a value to each word feature in a first of the at least two word images and to its corresponding word feature in a second of the at least two word images; determining the distance between each word feature in the first of the at least two word images and its corresponding word feature in the second of the at least two word images, based on the values assigned to those word features; establishing, in advance, a feature weight for each of the word features; and summing together the determined distance for all of the corresponding word features, with each determined distance weighted according to the feature weight established for that word feature. 5. The method of claim 4 , wherein the determined distance is calculated based on the dimensional nature of the value, and wherein for word features having a value represented by single value, the determined distance is calculated by determining the difference between: (1) the value of each word feature of the first of the at least two word images and (2) the value of its corresponding word feature in the second of the at least two word images. 6. The method of claim 4 , wherein the determined distance is calculated based on the dimensional nature of the value, and wherein for word features having a value represented by sequence of values and for word features having a value represented by multi-dimensional values, the determined distance is calculated using dynamic time warping. 7. The method of claim 1 , further comprising: calculating a mean of values for word features of the multiple word images assigned to the one group of word images, wherein the step of selecting a representative word image comprises selecting a word image in the one group of word images that is closest to the mean. 8. The method of claim 1 , wherein the step of selecting digitized text for the representative word image comprises: displaying the representative word image to a handwriting analyst; and receiving the selected digitized text from the handwriting analyst based on the displayed representative word. 9. The method of claim 1 , wherein the multiple word features are selected from a group comprising: top line profile, bottom-line profile, left line profile, right line profile, vertical projection profile, horizontal projection profile, peaks, valleys, watershed cup areas, watershed cap areas, loops and holes, intersections and crossings, stroke orientation, word aspect ratio, and convex whole. 10. The method of claim 1 , further comprising providing a digitized record with searchable computer-readable text based on the selected digitized text assigned to each of the word images. 11. The method of claim 1 , further comprising: calculating a mean of the word feature values of the multiple word images assigned to the one group of word images, wherein the step of selecting a representative word image comprises selecting, as the representative word, a phantom word image that has values equal to the mean. 12. The system of claim 1 , wherein the stored instructions further configure the system to: calculate a mean of values for word features of the multiple word images assigned to the one group of word images, wherein the step of selecting a representative word image comprises selecting, as the representative word, a phantom word image that has a value equal to the mean. 13. A system for creating digitized text for a record from an image of the record, comprising: one or more processors; and a memory, the memory storing instructions that are executable by the one or more processors and configure the system to: obtain one or more digital images of a record; evaluate the digital images in order to locate each of multiple word images; for each located word image, identify multiple word features of that word image; assign, to one group of word images, designated ones of the multiple word images based on the distance between the multiple word images, the distance representing the similarity of word features between at least two of the word images; select a representative word image in the one group of word images, by calculating, at the system, word feature values for the word features of each of the multiple word images assigned to the one group of word images, and using the word feature values to determine, at the system, a word image that is representative of the word images in the one group of word images; select digitized text for the representative word image; and assign the selected digitized text to each of the word images in the one group of word images. 14. The system of claim 13 , wherein the record is a historical record having handwritten words, and wherein the multiple word images are each an image of one of the handwritten words. 15. The system of claim 14 , wherein the word images each have corresponding word features, and wherein the stored instructions further configure the system to: assign a value to each of the word features; and calculate a distance between the at least two of the word images, based on the difference between values of corresponding word features of those two words. 16. The system of claim 15 , wherein the word images each have corresponding word features, and wherein the stored instructions further configure the system to: assign a value to each word feature in a first of the at least two word images and to its corresponding word feature in a second of the at least two word images; determine the distance between each word feature in the first of the at least two word images and its corresponding word feature in the second of the at least two word images, based on the values assigned to those word features; establish, in advance, a feature weight for each of the word features; and sum together the determined distance for all of the corre

Assignees

Inventors

Classifications

  • using clustering, e.g. of similar faces in social networks · CPC title

  • using word shape · CPC title

  • using recognition of characters or words · CPC title

  • Matching criteria, e.g. proximity measures · CPC title

  • Clustering techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9767353B2 cover?
A handwriting recognition system converts word images on documents, such as document images of historical records, into computer searchable text. Word images (snippets) on the document are located, and have multiple word features identified. For each word image, a word feature vector is created representing multiple word features. Based on the similarity of word features (e.g., the distance bet…
Who is the assignee on this patent?
Ancestry Com Operations Inc
What technology area does this patent fall under?
Primary CPC classification G06V30/2264. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 19 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).