Encoding of multichannel digital audio signals
US-8964994-B2 · Feb 24, 2015 · US
US9495591B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9495591-B2 |
| Application number | US-201213664295-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 30, 2012 |
| Priority date | Apr 13, 2012 |
| Publication date | Nov 15, 2016 |
| Grant date | Nov 15, 2016 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems and articles of manufacture for recognizing and locating one or more objects in a scene are disclosed. An image and/or video of the scene are captured. Using audio recorded at the scene, an object search of the captured scene is narrowed down. For example, the direction of arrival (DOA) of a sound can be determined and used to limit the search area in a captured image/video. In another example, keypoint signatures may be selected based on types of sounds identified in the recorded audio. A keypoint signature corresponds to a particular object that the system is configured to recognize. Objects in the scene may then be recognized using a shift invariant feature transform (SIFT) analysis comparing keypoints identified in the captured scene to the selected keypoint signatures.
Opening claim text (preview).
What is claimed is: 1. A method performed by a device, the method comprising: computing a plurality of acoustic-recognition features from audio recorded at a scene; comparing the acoustic-recognition features to predetermined acoustic-recognition features corresponding to one or more objects to determine a sound source type of an object; selecting keypoints corresponding to the object based on the sound source type; and identifying the object based on the selected keypoints and the sound source type. 2. The method of claim 1 , further comprising: selecting one or more keypoint signatures corresponding to one or more objects, based on audio recorded at the scene; identifying a plurality of keypoints in an image of the scene; and comparing the keypoints to the keypoint signatures to identify the object. 3. The method of claim 1 , further comprising: selecting a portion of a scene image based on the audio recorded at the scene; and selecting the keypoints only from within the portion of the image. 4. The method of claim 3 , wherein selecting a portion of the image based on the audio recorded at the scene includes: determining an audio direction of arrival (DOA) from the audio; and selecting the portion of the image based on the audio DOA. 5. The method of claim 4 , wherein determining the audio DOA includes: receiving the audio at a plurality of microphones located at the scene, whereby producing a plurality of microphone signals; and determining the audio DOA based on the microphone signals. 6. The method of claim 1 , further comprising: computing a plurality of local motion vectors from a video recording of the scene; and identifying the object by comparing the local motion vectors to a database of predetermined local motion vectors corresponding to one or more objects and by comparing the keypoints to one or more keypoint signatures. 7. The method of claim 1 , wherein identifying the object is based on comparing the keypoints to one or more keypoint signatures. 8. The method of claim 7 , wherein the acoustic-recognition features include mel-frequency cepstral coefficients. 9. The method of claim 1 , further comprising: determining range information for one or more objects appearing in an image; and analyzing the keypoints based on the range information. 10. The method of claim 9 , wherein determining range information is selected from the group consisting of determining range information using an auto-focus camera, determining range information using a multi-camera image disparity estimation and any combination of the foregoing. 11. An apparatus, comprising: an audio processor configured to compute a plurality of acoustic-recognition features from audio recorded at a scene; a keypoint selector configured to select keypoints corresponding to an object based on a sound source type; and a matching device configured to identify the object based on the selected keypoints and comparing the acoustic-recognition features to predetermined acoustic-recognition features corresponding to one or more objects to determine the sound source type of the object. 12. The apparatus of claim 11 , further comprising: a keypoint detector configured to identify a plurality of keypoints in an image of a scene; wherein the keypoint selector is configured to select one or more keypoint signatures corresponding to one or more objects, based on audio recorded at the scene; and wherein the matching device is configured to compare the keypoints to the keypoint signatures to identify an object in the scene. 13. The apparatus of claim 11 , further comprising: a first selector configured to select a portion of an image of the scene based on the audio recorded at the scene; and a second selector configured to select the keypoints only from within the portion of the image. 14. The apparatus of claim 13 , wherein the first selector includes: a detector configured to determine an audio direction of arrival (DOA) from the audio; and a third selector configured to select the portion of the image based on the audio DOA. 15. The apparatus of claim 14 , wherein the detector includes: a plurality of microphones located at the scene for receiving the audio, producing a plurality of microphone signals; and an audio processor configured to determine the audio DOA based on the microphone signals. 16. The apparatus of claim 11 , further comprising: a video processor configured to compute a plurality of local motion vectors from a video recording of the scene; wherein the matching device is configured to identify the object by comparing the local motion vectors to a database of predetermined local motion vectors corresponding to one or more objects and by comparing the keypoints to one or more keypoint signatures. 17. The apparatus of claim 11 , wherein the matching device is configured to identify the object by comparing the keypoints to one or more keypoint signatures. 18. The apparatus of claim 17 , wherein the acoustic-recognition features include mel-frequency cepstral coefficients. 19. The apparatus of claim 11 , further comprising: a range detector configured to determine range information for one or more objects appearing in an image; and a keypoint detector configured to analyze the keypoints based on the range information. 20. The apparatus of claim 19 , wherein the range detector includes a detector selected from the group consisting of an auto-focus camera, a multi-camera array and any combination of the foregoing. 21. An apparatus, comprising: means for computing a plurality of acoustic-recognition features from audio recorded at a scene; means for comparing the acoustic-recognition features to predetermined acoustic-recognition features corresponding to one or more objects to determine a sound source type of an object; means for selecting keypoints corresponding to the object based on the sound source type; and means for identifying the object based on the selected keypoints and the sound source type. 22. The apparatus of claim 21 , further comprising: means for selecting one or more keypoint signatures corresponding to one or more objects, based on audio recorded at the scene; means for identifying a plurality of keypoints in an image of the scene; and means for comparing the keypoints to the keypoint signatures to identify the object in the scene. 23. The apparatus of claim 21 , further comprising: means for selecting a portion of an image of the scene based on the audio recorded at the scene; and means for selecting the keypoints only from within the portion of the image. 24. The apparatus of claim 23 , wherein the means for selecting a portion of the image based on the audio recorded at the scene includes: means for determining an audio direction of arrival (DOA) from the audio; and means for selecting the portion of the image based on the audio DOA. 25. The apparatus of claim 24 , wherein means for determining the audio DOA includes: means for receiving the audio at a plurality of microphones located at the scene, whereby producing a plurality of microphone signals; and means for determining the audio DOA based on the microphone signals. 26. The apparatus of claim 21 , further comprising: means for computing a plurality of local motion vectors from a video recording of the scene; and means for identifying the object by comparing t
Control circuits for electronic adaptation of the sound field · CPC title
the classifiers operating on different input data, e.g. multi-modal recognition · CPC title
Scenes; Scene-specific elements (control of digital cameras H04N23/60) · CPC title
of results relating to different input data, e.g. multimodal recognition · CPC title
for combining the signals of two or more microphones (specially adapted for hearing aids H04R25/407) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.