What technology area does this patent fall under?

Primary CPC classification G06V30/2264. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 11 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System and method for transcribing handwritten records using word grouping with assigned centroids

US9619702B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9619702-B2
Application number	US-201514841542-A
Country	US
Kind code	B2
Filing date	Aug 31, 2015
Priority date	Aug 29, 2014
Publication date	Apr 11, 2017
Grant date	Apr 11, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A handwriting recognition system converts word images on documents, such as document images of historical records, into computer searchable text. Word images (snippets) on the document are located, and have multiple word features identified. For each word image, a word feature vector is created representing multiple word features. Based on the similarity of word features (e.g., the distance between feature vectors), similar words are grouped together in clusters, and a centroid that has features most representative of words in the cluster is selected. A digitized text word is selected for each cluster based on review of a centroid in the cluster, and is assigned to all words in that cluster and is used as computer searchable text for those word images where they appear in documents. An analyst may review clusters to permit refinement of the parameters used for grouping words in clusters, including the adjustment of weights and other factors used for determining the distance between feature vectors.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for creating digitized text for a record from an image of the record, comprising: obtaining a digital image of a record; evaluating the record image in order to locate each of multiple word images; for each located word image, identifying multiple word features of that word image; assigning each of the multiple word images that have similar word features to one of a plurality of word clusters; selecting a representative word image in each of the word clusters as a centroid; reviewing, by an analyst, the centroid in each of the word clusters, and entering digitized text for the centroid; and assigning the digitized text for the centroid to all other word images in the same word cluster as the centroid. 2. The method of claim 1 , further comprising: reviewing, by the analyst, at least one sampling of word images in at least one word cluster; determining, based on judgment of the analyst, whether the sampled word images are the same word as the centroid for the word cluster and whether the sampled words have been correctly included in the word cluster; determining that a threshold number of the sampled word images have not been correctly included in the word cluster; and in response to determining that a threshold number of words have not been correctly included, marking the cluster as suspicious. 3. The method of claim 2 , further comprising: determining that a threshold number of the sample word images have been correctly included in the cluster; and in response to determining that a threshold number of words have been correctly included in the cluster, maintaining the cluster. 4. The method of claim 2 , wherein each of the word images have corresponding word features, and wherein the method further comprises: assigning a value to each of the word features; assigning a weight to each of the word features; assigning each of the multiple word images that have similar word features to one of a plurality of word clusters, based at least partially on the weight; and in response to determining that a threshold number of words have not been correctly included, adjusting the assigned weight by the analyst. 5. The method of claim 1 , wherein the record is a historical record having handwritten words, and wherein the multiple word images are each an image of one of the handwritten words. 6. The method of claim 1 , wherein assigning each of the multiple word images to one of a plurality of clusters comprises: assigning one or more values to each of the multiple word features for each word image in order to create a feature vector for that word image; and assigning each word image to a word cluster based on its feature vector. 7. The method of claim 1 , wherein the step of assigning each word image to a word cluster based on its feature vector, comprises: calculating a distance between each one of the multiple word images and every other one of the multiple word images, based on feature vectors associated with those word images; selecting, from among the multiple word images, two of the word images that are closest in distance to each other; and assigning the two of the word images to the word cluster. 8. The method of claim 7 , further comprising: selecting, from among the multiple word images other than the assigned word images, an additional one of the multiple word images that is closest to the representative word image; assigning the additional one of the word images to the word cluster; and repeating the foregoing steps until a predetermined number of the multiple word images have been assigned to the word cluster. 9. The method of claim 8 , wherein the step of selecting a representative word image as a centroid comprises: determining a mean of the values in feature vectors for the word images that are assigned to the word cluster; and selecting, as the representative word image, one of the word images having values in its associated feature vector closest to the mean. 10. The method of claim 1 , wherein the multiple word features are selected from a group comprising: top line profile, bottom-line profile, left line profile, right line profile, vertical projection profile, horizontal projection profile, peaks, valleys, watershed cup areas, watershed cap areas, loops and holes, intersections and crossings, stroke orientation, word aspect ratio, and convex whole. 11. A system for creating digitized text for a record from an image of the record, comprising: one or more processors; and a memory, the memory storing instructions that are executable by the one or more processors and configure the system to: obtain a digital image of a record; evaluate the record image in order to locate each of multiple word images; for each located word image, identify multiple word features of that word image; assign each of the multiple word images that have similar word features to one of a plurality of word clusters; select a representative word image in each of the word clusters as a centroid; receive, from an analyst, the centroid in each of the word clusters, and entering digitized text for the centroid; and assign the digitized text for the centroid to all other word images in the same word cluster as the centroid. 12. The system of claim 11 , wherein the stored instructions further configure the system to: receive, from the analyst, at least one sampling of word images in at least one word cluster; determine, based on judgment of the analyst, whether the sampled word images are the same word as the centroid for the word cluster and whether the sampled words have been correctly included in the word cluster; determine that a threshold number of the sampled word images have not been correctly included in the word cluster; and in response to determining that a threshold number of words have not been correctly included, mark the cluster as suspicious. 13. The system of claim 12 , wherein the stored instructions further configure the system to: determine that a threshold number of the sample word images have been correctly included in the cluster; and in response to determining that a threshold number of words have been correctly included in the cluster, maintain the cluster. 14. The system of claim 12 , wherein each of the word images have corresponding word features, and wherein the stored instructions further configure the system to: assign a value to each of the word features; assign a weight to each of the word features; assign each of the multiple word images that have similar word features to one of a plurality of word clusters, based at least partially on the weight; and in response to determining that a threshold number of words have not been correctly included, adjust the assigned weight by the analyst. 15. The system of claim 11 , wherein the record is a historical record having handwritten words, and wherein the multiple word images are each an image of one of the handwritten words. 16. The system of claim 11 , wherein each of the multiple word images is assigned to one of a plurality of clusters by: assigning one or more values to each of the multiple word features for each word image in order to create a feature vector for that word image; and assigning each word image to a word cluster based on its feature vector. 17. The system of claim 11 , wherein each word image is assigned to a word cluster based on its feature vector, by: calculating a distance between each one of the multiple word images and every other one of the multiple word images, based on feature vectors associated

Assignees

Ancestry Com Operations Inc

Inventors

Classifications

G06V10/762
using clustering, e.g. of similar faces in social networks · CPC title
G06V30/2264Primary
using word shape · CPC title
G06V30/153
using recognition of characters or words · CPC title
G06F18/23
Clustering techniques · CPC title
G06F18/22
Matching criteria, e.g. proximity measures · CPC title

Patent family

Related publications grouped by family.

View patent family 55402855

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9619702B2 cover?: A handwriting recognition system converts word images on documents, such as document images of historical records, into computer searchable text. Word images (snippets) on the document are located, and have multiple word features identified. For each word image, a word feature vector is created representing multiple word features. Based on the similarity of word features (e.g., the distance bet…
Who is the assignee on this patent?: Ancestry Com Operations Inc
What technology area does this patent fall under?: Primary CPC classification G06V30/2264. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 11 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).