Text recognition and localization with deep learning

US10032072B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10032072-B1
Application numberUS-201615188792-A
CountryUS
Kind codeB1
Filing dateJun 21, 2016
Priority dateJun 21, 2016
Publication dateJul 24, 2018
Grant dateJul 24, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Approaches provide for identifying text represented in image data as well as determining a location or region of the image data that includes the text represented in the image data. For example, a camera of a computing device can be used to capture a live camera view of one or more items. The live camera view can be presented to the user on a display screen of the computing device. An application executing on the computing device or at least in communication with the computing device can analyze the image data of the live camera view to identify text represented in the image data as well as determine locations or regions of the image that include the representations. For example, one such recognition approach includes a region proposal process to generate a plurality of candidate bounding boxes, a region filtering process to determine a subset of the plurality of candidate bounding boxes, a region refining process to refine the bounding box coordinates to more accurately fit the identified text, a text recognizer process to recognize words in the refined bounding boxes, and a post-processing process to suppress overlapping bounding boxes to generate a final set of bounding boxes.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: obtaining image data that includes representations of text; determining a plurality of regions of interest, a first set of the plurality of regions of interest including the representations of text and a second set of the plurality of regions of interest including potential representations of text; using a first trained neural network to identify the first set of the plurality of regions of interest; using a second trained neural network to update a position of each region of the first set of the plurality of regions of interest to include respective text representations within a predetermined deviation; using a third trained neural network on the first set of the plurality of regions of interest to recognize words associated with respective regions based at least in part on the respective text representations; receiving a selection of one of a recognized word from the list of recognized words; causing a query to be executed against a data source, the query including the recognized word; receiving, in response to the query, result information for a set of items, the set of items determined by comparing the word to a library of words, each word in the library of words associated with at least one item; and generating a list of recognized words. 2. The computer-implemented method of claim 1 , further comprising: displaying, on a display screen of a computing device, the result information for the set of items. 3. The computer-implemented method of claim 2 , wherein displaying the result information for at least a an item of the set of items includes switching to a result view, the results view including one of a price of the item, a rating of the item, images of the item, or additional information about the item. 4. The computer-implemented method of claim 1 , further comprising: displaying a graphical outline for each recognized word, on a display screen overlying an image generated using the image data; and displaying, for a subset of graphical outlines, a label that indicates a word included in an associated graphical outline. 5. The computer-implemented method of claim 4 , further comprising: receiving a selection of one of the labels, the selected label associated with a product offered through an electronic marketplace; receiving, in response to the selection, result information for the product; and displaying, on the display screen, the result information for the product. 6. The computer-implemented method of claim 1 , further comprising: iteratively applying the second trained neural network and the third trained neural network to recognize the words to a threshold confidence level; identifying overlapping words; and removing the overlapping words to generate a final set of recognized words. 7. The computer-implemented method of claim 1 , further comprising: training a neural network to detect regions of interest that do not contain text, wherein the trained neural network corresponds to the first trained neural network. 8. The computer-implemented method of claim 1 , wherein the image data is analyzed using a region proposal component operable to determine the plurality of regions of interest, a region filtering component operable to determine a subset of the plurality of regions of interest, a region refining component operable to refine position coordinates associated with each of the subset of the plurality of regions of interest, a text recognizer component operable to recognize words in the subset of the plurality of regions of interest, and a post-processing component operable to suppress overlapping regions of interest to determine a final set of regions of interest. 9. The computer-implemented method of claim 1 , wherein determining the plurality of regions of interest further includes: using object region proposal techniques to determine a first predetermined range of regions of interest; using text-specific region proposal techniques to determine a second predetermined range of regions of interest; and combining the first predetermined range of regions of interest and the second predetermined range of regions of interest to determine the plurality of regions of interest. 10. The computer-implemented method of claim 1 , wherein the second trained neural network is operable to iteratively adjust a size of each region of the first set of the plurality of regions of interest to accommodate one or more words in their entirety, iteratively adjust a size of a region of the first set of the plurality of regions of interest to accommodate one or more word in their entirety, iteratively reposition a region of the first set of the plurality of regions of interest to accommodate one or more words in their entirety, or iteratively change a shape of a region of the first set of the plurality of regions of interest to accommodate one or more words in their entirety. 11. The computer-implemented method of claim 10 , further comprising: generating a background layer using portions of images from a database of images; generating foreground text; merging the background layer and the foreground text to generate a blended image; and adding one or random noise or artifacts to the blended image to generate synthetic text data. 12. The computer-implemented method of claim 11 , wherein the first trained neural network, the second trained neural network, and the third trained neural network is trained using synthetic text data. 13. A computing device, comprising: at least one processor; a camera configured to capture image data over a field of view; a display screen; and memory including instructions that, when executed by the at least one processor, cause the computing device to: obtain the image data that includes representations of text; determine a plurality of regions of interest, a first set of the plurality of regions of interest including the representations of text and a second set of the plurality of regions of interest including potential representations of text; use a first trained neural network to identify the first set of the plurality of regions of interest; use a second trained neural network to update a position of each region of the first set of the plurality of regions of interest to include respective text representations within a predetermined deviation; use a third trained neural network on the first set of the plurality of regions of interest to recognize words associated with respective regions based at least in part on the respective text representations; generate a list of recognized words; receive a selection of one of a recognized word from the list of recognized words; submit a query to a search engine using the recognized word as a query term; and receive, in response to the query, result information for a set of items. 14. The computing device of claim 13 , wherein the instructions when executed further cause the computing device to: display, on the display screen, the result information for the set of items. 15. The computing device of claim 14 , wherein the instructions when executed further cause the computing device to: displaying a graphical outline for each recognized word, on a display screen overlying an image generated using the image data; enable a user to adjust a shape of the graphical outline to generate an updated graphical outline; and submit a new query to the search engine in response to a change in shape of the graphical outline, the new query including words included in the updated graphical outline. 16. The computing device

Assignees

Inventors

Classifications

  • Determining representative reference patterns, e.g. averaging or distorting patterns; Generating dictionaries · CPC title

  • G06V10/82Primary

    using neural networks · CPC title

  • Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries · CPC title

  • by using electronic viewfinders · CPC title

  • Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10032072B1 cover?
Approaches provide for identifying text represented in image data as well as determining a location or region of the image data that includes the text represented in the image data. For example, a camera of a computing device can be used to capture a live camera view of one or more items. The live camera view can be presented to the user on a display screen of the computing device. An applicati…
Who is the assignee on this patent?
A9 Com Inc
What technology area does this patent fall under?
Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 24 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).