Computer system employing speech recognition for detection of non-speech audio
US-2015002611-A1 · Jan 1, 2015 · US
US9595271B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9595271-B2 |
| Application number | US-201313929375-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 27, 2013 |
| Priority date | Jun 27, 2013 |
| Publication date | Mar 14, 2017 |
| Grant date | Mar 14, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer system executing a computer audio application such as video conferencing applies audio detection and speech recognition to an input audio stream to generate respective audio detection and speech recognition signals. A function is applied to the audio detection and speech recognition signals to generate a non-speech audio detection signal identifying presence of non-speech audio in the input audio stream when the audio detection signal is asserted and the speech recognition signal is not asserted. A control or indication action is performed in the computer system based on assertion of the non-speech audio detection signal.
Opening claim text (preview).
What is claimed is: 1. A method of operating a video conferencing system having a conference server connected to conference clients by a network, the conference clients receiving audio streams from respective participants and having respective conference graphical user interfaces (GUIs), comprising: by the conference server, (1) receiving audio streams from the conference clients, mixing the audio streams together to generate a conference audio feed, and sending the conference audio feed to the conference clients, and (2) generating conference graphical content and sending the conference graphical content to the conference clients for local rendering in the respective conference GUIs; applying audio detection processing and speech recognition processing to an input audio stream of a participant to generate distinct audio detection and speech recognition signals, the audio detection signal being generated upon the audio detection processing indicating presence of audio in the audio stream in both a speech audio condition and a non-speech audio condition of the audio stream of the participant, the speech recognition signal being generated upon the speech recognition processing indicating presence of speech audio in the audio stream in the speech audio condition; processing the audio detection and speech recognition signals to identify distinct conditions of (1) a silence condition, (2) the speech audio condition, and (3) the non-speech audio condition of the audio stream of the participant, the silence condition being identified by the non-generation of the audio detection signal, the speech audio condition being identified by the generation of the speech recognition signal, and the non-speech audio condition being identified by the generation of the audio detection signal while the speech recognition signal is not generated; and operating the conference GUIs to reflect the silence, speech audio, and non-speech audio conditions of the audio stream of the participant, including (a) in the speech audio condition, providing a first graphical identification of the participant as a speaking participant, (b) in both the silence condition and the non-speech audio condition, providing a second graphical identification of the participant as a non-speaking participant, and (c) in the non-speech audio condition, providing a third graphical identification of the participant as generating non-speech audio in the audio stream. 2. A method according to claim 1 , further including muting or reducing amplitude of the audio stream in the non-speech audio condition. 3. A method according to claim 1 , wherein providing the graphical indications includes applying corresponding different treatments to respective camera viewing windows of the speakers and non-speakers, the treatments being selected from relative sizes, relative positions, and relative highlighting. 4. A method according to claim 1 , wherein the applying and processing steps are performed at the conference server for the audio stream as received from the respective conference client, and wherein the conference server generates the first and second graphical indications and sends them to the respective conference client. 5. A method according to claim 1 , wherein the applying and processing steps are performed at the conference client for the audio stream of the participant. 6. A method according to claim 5 , wherein the discriminating between speech and non-speech audio at each of the conference clients generates respective discrimination results, and the discrimination results from all the conference clients are provided to the conference server to enable the conference server to identify the speaker and non-speakers and to return information regarding the speaker and non-speakers to the conference clients for controlling respective graphical user interfaces accordingly. 7. A method according to claim 1 , wherein the speech recognition processing provides a speech output and a separate confidence output indicating a level of confidence in accuracy of the speech output, and wherein a condition of no speech being recognized is based on the confidence output indicating a level of confidence below a predetermined threshold. 8. A method according to claim 1 , wherein the speech recognition processing is a secondary use of the speech recognition processing in the video conferencing system, and wherein the video conferencing system makes a distinct primary use of the speech recognition processing for obtaining speech content. 9. A method according to claim 8 , wherein the primary use includes making a transcription of a speech-carrying session. 10. A method according to claim 1 , wherein the audio detection processing is done using level detection by measuring an amplitude of an audio signal in the audio stream and comparing the measured amplitude against an amplitude threshold. 11. A method according to claim 1 , wherein there is a split of audio processing for different ones of the conference clients, the conference server performing the audio detection processing for lower-performance conference clients, and higher-performance conference clients performing their own audio detection processing. 12. A method according to claim 11 , wherein conference clients having poor network performance perform their own audio detection processing to avoid reduced speech recognition accuracy affected by sending audio samples to the conference server via the poor-performance network. 13. A non-transitory computer-readable medium storing computer program instructions, the instructions being executable by a video conferencing system having a conference server connected to conference clients by a network, the conference clients receiving audio streams from respective participants and having respective conference graphical user interfaces (GUIs), the execution of the instructions causing the video conferencing system to perform a method including: by the conference server, (1) receiving audio streams from the conference clients, mixing the audio streams together to generate a conference audio feed, and sending the conference audio feed to the conference clients, and (2) generating conference graphical content and sending the conference graphical content to the conference clients for local rendering in the respective conference GUIs; applying separate audio detection processing and speech recognition processing to an input audio stream of a participant to generate distinct audio detection and speech recognition signals, the audio detection signal being generated upon the audio detection processing indicating presence of audio in the audio stream in both a speech audio condition and a non-speech audio condition of the audio stream of the participant, the speech recognition signal being generated upon the speech recognition processing indicating presence of speech audio in the audio stream in the speech audio condition; processing the audio detection and speech recognition signals to identify distinct conditions of (1) a silence condition, (2) the speech audio condition, and (3) the non-speech audio condition of the audio stream of the participant, the silence condition being identified by the non-generation of the audio detection signal, the speech audio condition being identified by the generation of the speech recognition signal, and the non-speech audio condition being identified by the generation of the audio detection signal while the speech recognition signal is not generated; and operating the conference GUIs to reflect the silence, speech audio, and non-speech audio conditions of the audio stream of the participant, including (a) in the speech audio
Speech classification or search · CPC title
Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title
for discriminating voice from noise · CPC title
Conference systems · CPC title
Network arrangements for conference optimisation or adaptation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.