Object recognition using multi-modal matching scheme

US9495591B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9495591-B2
Application numberUS-201213664295-A
CountryUS
Kind codeB2
Filing dateOct 30, 2012
Priority dateApr 13, 2012
Publication dateNov 15, 2016
Grant dateNov 15, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems and articles of manufacture for recognizing and locating one or more objects in a scene are disclosed. An image and/or video of the scene are captured. Using audio recorded at the scene, an object search of the captured scene is narrowed down. For example, the direction of arrival (DOA) of a sound can be determined and used to limit the search area in a captured image/video. In another example, keypoint signatures may be selected based on types of sounds identified in the recorded audio. A keypoint signature corresponds to a particular object that the system is configured to recognize. Objects in the scene may then be recognized using a shift invariant feature transform (SIFT) analysis comparing keypoints identified in the captured scene to the selected keypoint signatures.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by a device, the method comprising: computing a plurality of acoustic-recognition features from audio recorded at a scene; comparing the acoustic-recognition features to predetermined acoustic-recognition features corresponding to one or more objects to determine a sound source type of an object; selecting keypoints corresponding to the object based on the sound source type; and identifying the object based on the selected keypoints and the sound source type. 2. The method of claim 1 , further comprising: selecting one or more keypoint signatures corresponding to one or more objects, based on audio recorded at the scene; identifying a plurality of keypoints in an image of the scene; and comparing the keypoints to the keypoint signatures to identify the object. 3. The method of claim 1 , further comprising: selecting a portion of a scene image based on the audio recorded at the scene; and selecting the keypoints only from within the portion of the image. 4. The method of claim 3 , wherein selecting a portion of the image based on the audio recorded at the scene includes: determining an audio direction of arrival (DOA) from the audio; and selecting the portion of the image based on the audio DOA. 5. The method of claim 4 , wherein determining the audio DOA includes: receiving the audio at a plurality of microphones located at the scene, whereby producing a plurality of microphone signals; and determining the audio DOA based on the microphone signals. 6. The method of claim 1 , further comprising: computing a plurality of local motion vectors from a video recording of the scene; and identifying the object by comparing the local motion vectors to a database of predetermined local motion vectors corresponding to one or more objects and by comparing the keypoints to one or more keypoint signatures. 7. The method of claim 1 , wherein identifying the object is based on comparing the keypoints to one or more keypoint signatures. 8. The method of claim 7 , wherein the acoustic-recognition features include mel-frequency cepstral coefficients. 9. The method of claim 1 , further comprising: determining range information for one or more objects appearing in an image; and analyzing the keypoints based on the range information. 10. The method of claim 9 , wherein determining range information is selected from the group consisting of determining range information using an auto-focus camera, determining range information using a multi-camera image disparity estimation and any combination of the foregoing. 11. An apparatus, comprising: an audio processor configured to compute a plurality of acoustic-recognition features from audio recorded at a scene; a keypoint selector configured to select keypoints corresponding to an object based on a sound source type; and a matching device configured to identify the object based on the selected keypoints and comparing the acoustic-recognition features to predetermined acoustic-recognition features corresponding to one or more objects to determine the sound source type of the object. 12. The apparatus of claim 11 , further comprising: a keypoint detector configured to identify a plurality of keypoints in an image of a scene; wherein the keypoint selector is configured to select one or more keypoint signatures corresponding to one or more objects, based on audio recorded at the scene; and wherein the matching device is configured to compare the keypoints to the keypoint signatures to identify an object in the scene. 13. The apparatus of claim 11 , further comprising: a first selector configured to select a portion of an image of the scene based on the audio recorded at the scene; and a second selector configured to select the keypoints only from within the portion of the image. 14. The apparatus of claim 13 , wherein the first selector includes: a detector configured to determine an audio direction of arrival (DOA) from the audio; and a third selector configured to select the portion of the image based on the audio DOA. 15. The apparatus of claim 14 , wherein the detector includes: a plurality of microphones located at the scene for receiving the audio, producing a plurality of microphone signals; and an audio processor configured to determine the audio DOA based on the microphone signals. 16. The apparatus of claim 11 , further comprising: a video processor configured to compute a plurality of local motion vectors from a video recording of the scene; wherein the matching device is configured to identify the object by comparing the local motion vectors to a database of predetermined local motion vectors corresponding to one or more objects and by comparing the keypoints to one or more keypoint signatures. 17. The apparatus of claim 11 , wherein the matching device is configured to identify the object by comparing the keypoints to one or more keypoint signatures. 18. The apparatus of claim 17 , wherein the acoustic-recognition features include mel-frequency cepstral coefficients. 19. The apparatus of claim 11 , further comprising: a range detector configured to determine range information for one or more objects appearing in an image; and a keypoint detector configured to analyze the keypoints based on the range information. 20. The apparatus of claim 19 , wherein the range detector includes a detector selected from the group consisting of an auto-focus camera, a multi-camera array and any combination of the foregoing. 21. An apparatus, comprising: means for computing a plurality of acoustic-recognition features from audio recorded at a scene; means for comparing the acoustic-recognition features to predetermined acoustic-recognition features corresponding to one or more objects to determine a sound source type of an object; means for selecting keypoints corresponding to the object based on the sound source type; and means for identifying the object based on the selected keypoints and the sound source type. 22. The apparatus of claim 21 , further comprising: means for selecting one or more keypoint signatures corresponding to one or more objects, based on audio recorded at the scene; means for identifying a plurality of keypoints in an image of the scene; and means for comparing the keypoints to the keypoint signatures to identify the object in the scene. 23. The apparatus of claim 21 , further comprising: means for selecting a portion of an image of the scene based on the audio recorded at the scene; and means for selecting the keypoints only from within the portion of the image. 24. The apparatus of claim 23 , wherein the means for selecting a portion of the image based on the audio recorded at the scene includes: means for determining an audio direction of arrival (DOA) from the audio; and means for selecting the portion of the image based on the audio DOA. 25. The apparatus of claim 24 , wherein means for determining the audio DOA includes: means for receiving the audio at a plurality of microphones located at the scene, whereby producing a plurality of microphone signals; and means for determining the audio DOA based on the microphone signals. 26. The apparatus of claim 21 , further comprising: means for computing a plurality of local motion vectors from a video recording of the scene; and means for identifying the object by comparing t

Assignees

Inventors

Classifications

  • H04S7/30Primary

    Control circuits for electronic adaptation of the sound field · CPC title

  • the classifiers operating on different input data, e.g. multi-modal recognition · CPC title

  • Scenes; Scene-specific elements (control of digital cameras H04N23/60) · CPC title

  • of results relating to different input data, e.g. multimodal recognition · CPC title

  • for combining the signals of two or more microphones (specially adapted for hearing aids H04R25/407) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9495591B2 cover?
Methods, systems and articles of manufacture for recognizing and locating one or more objects in a scene are disclosed. An image and/or video of the scene are captured. Using audio recorded at the scene, an object search of the captured scene is narrowed down. For example, the direction of arrival (DOA) of a sound can be determined and used to limit the search area in a captured image/video. In…
Who is the assignee on this patent?
Qualcomm Inc
What technology area does this patent fall under?
Primary CPC classification H04S7/30. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Nov 15 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).