Personalized content tagging
US-2015046418-A1 · Feb 12, 2015 · US
US2016005394A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2016005394-A1 |
| Application number | US-201314766246-A |
| Country | US |
| Kind code | A1 |
| Filing date | Dec 20, 2013 |
| Priority date | Feb 14, 2013 |
| Publication date | Jan 7, 2016 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
There is provided an apparatus and a method for rapidly extracting a target sound from a sound signal where a variety of sounds are mixed generated from a plurality of the sound sources. There is a voice recognition unit including a tracking unit for detecting a sound source direction and a voice segment to execute a sound source extraction process, and a voice recognition unit for inputting a sound source extraction result to execute a voice recognition process. In the tracking unit, a segment being created management unit that creates and manages a voice segment per unit of sound source sequentially detects a sound source direction, sequentially updates a voice segment estimated by connecting a detection result to a time direction, creates an extraction filter for a sound source extraction after a predetermined time is elapsed, and sequentially creates a sound source extraction result by sequentially applying the extraction filter to an input voice signal. The voice recognition unit sequentially executes the voice recognition process to a partial sound source extraction result to output a voice recognition result.
Opening claim text (preview).
1 . A voice recognition apparatus, comprising: a tracking unit for detecting a sound source direction and a voice segment to execute a sound source extraction process; and a voice recognition unit for inputting a sound source extraction result from the tracking unit to execute a voice recognition process, the tracking unit creating a segment being created management unit that creates and manages a voice segment per unit of sound source, each segment being created management unit created sequentially detecting a sound source direction to execute a voice segment creation process that sequentially updates a voice segment estimated by connecting a detection result to a time direction, creating an extraction filter for a sound source extraction after a predetermined time is elapsed from a voice segment beginning, and sequentially applying the extraction filter created to an input voice signal to sequentially create a partial sound source extraction result of a voice segment, the tracking unit sequentially outputting the partial sound source extraction result created by the segment being created management unit to the voice recognition unit, the voice recognition unit sequentially executing the voice recognition process to the partial sound source extraction result inputted from the tracking unit to output a voice recognition result. 2 . The voice recognition apparatus according to claim 1 , wherein the tracking unit executes a voice segment creation process to connect collectively a plurality of sound source direction information detected in accordance with a plurality of different methods to a time direction in each segment being created management unit. 3 . The voice recognition apparatus according to claim 1 , wherein the tracking unit immediately executes beginning or end determination process if it detects that a user's sign detected from an input image from an image input unit represents beginning or end of a voice segment. 4 . The voice recognition apparatus according to claim 1 , wherein the segment being created management unit of the tracking unit creates an extraction filter for preferentially extracting a voice of a specific sound source from an observation signal by utilizing an observation signal inputted from a time before beginning of a voice segment to a time when a filter is created. 5 . The voice recognition apparatus according to claim 1 , wherein the segment being created management unit of the tracking unit applies an extraction filter for preferentially extracting a voice of a specific sound source from an observation signal, estimates a whole dead corner space filter that attenuates a voice of all sound sources included in the observation signal used in the estimation of the extraction filter, and subtracts a result of applying the whole dead corner space filter from a result of applying the extraction filter to remove a disturbing sound not included in the observation signal and to create a sound source extraction result. 6 . The voice recognition apparatus according to claim 1 , wherein the segment being created management unit of the tracking unit changes a mask that decreases a transmittance of the observation signal for each frequency and each time as a proportion of a sound other than a target sound is higher than a target sound in the observation signal corresponding to the segment being created, executes time frequency masking process that sequentially applies the mask to the observation signal, and extracts a sound source of the target sound. 7 . The voice recognition apparatus according to claim 1 , further comprising: an extraction result buffering unit for temporary storing the sound source extraction result generated by the tracking unit; and a ranking unit for determining a priority to output a plurality of the sound source extraction results corresponding to the respective sound sources stored in the extraction result buffering unit, the ranking unit setting a priority of the sound source extraction result corresponding to the voice segment having the beginning or the end determined based on a user's explicit sign. 8 . The voice recognition apparatus according to claim 7 , wherein the tracking unit sets a “registered attribute” in order to identify a voice segment set based on a speaker's explicit sign provided based on an image analysis, and the ranking unit executes a process that sets a priority of the voice segment to which the registered attribute is set to high. 9 . The voice recognition apparatus according to claim 8 , wherein the ranking unit determines a priority to output to the voice recognition unit by applying the following scales: (Scale 1) the voice segment having the attribute of “registered” has a priority, if there are a plurality of the voice segments having the attribute of “registered”, the voice segment having the earliest beginning has a priority; (Scale 2) as to the voice segment not having the attribute of “registered”, the voice segment having the end already determined has a priority, if there are a plurality of the voice segments having the ends already determined, the voice segment having the earliest end has a priority; (Scale 3) the voice segment having the end not determined, the voice segment having the earliest beginning has a priority. 10 . The voice recognition apparatus according to claim 7 , wherein the voice recognition unit has a plurality of decoders for executing a voice recognition process, requests an output of a sound source extraction result generated by the tracking unit in accordance with availability of the decoders, inputs a sound source extraction result in accordance with the priority, and preferentially executes a voice recognition on a sound source extraction result having a high priority. 11 . The voice recognition apparatus according to claim 1 , wherein the tracking unit creates a feature amount adapted to a form used in a voice recognition of the voice recognition unit in each segment being created management unit, and outputs the feature amount created to the voice recognition unit. 12 . The voice recognition apparatus according to claim 11 , wherein the feature amount is a Mel-Frequency Cepstral Coefficient. 13 . The voice recognition apparatus according to claim 1 , further comprising: a sound input unit including a microphone array; an image input unit having a camera; a sound source direction estimation unit for estimating a sound source direction based on an inputted sound from the sound input unit; and an image process unit for analyzing a sound source direction based on an analysis of an inputted image from the image input unit, the tracking unit creating one integrated sound source direction information by applying sound source direction information created by the sound source direction estimation unit and sound source direction information created by the image process unit. 14 . The voice recognition apparatus according to claim 13 , wherein the image process unit includes a lip image process unit for detecting a movement of a speaker's lip area based on an analysis of an input image from the image input unit; and a hand image process unit for detecting a movement of a speaker's hand area. 15 . The voice recognition apparatus according to claim 13 , wherein the tracking unit sets an “registered attribute” in order to identify a voice segment set based on a speaker's explicit sign inputted from the image process unit, and performs a merge process between a voice segment having a registered attribute and a voice segment not having a registere
Voice signal separating · CPC title
Segmentation; Word boundary detection · CPC title
Constructional details of speech recognition systems · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.