What technology area does this patent fall under?

Primary CPC classification H04N7/15. Mapped technology areas include Electricity.

When was this patent published?

Publication date Tue Mar 14 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Computer system employing speech recognition for detection of non-speech audio

US9595271B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9595271-B2
Application number	US-201313929375-A
Country	US
Kind code	B2
Filing date	Jun 27, 2013
Priority date	Jun 27, 2013
Publication date	Mar 14, 2017
Grant date	Mar 14, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer system executing a computer audio application such as video conferencing applies audio detection and speech recognition to an input audio stream to generate respective audio detection and speech recognition signals. A function is applied to the audio detection and speech recognition signals to generate a non-speech audio detection signal identifying presence of non-speech audio in the input audio stream when the audio detection signal is asserted and the speech recognition signal is not asserted. A control or indication action is performed in the computer system based on assertion of the non-speech audio detection signal.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of operating a video conferencing system having a conference server connected to conference clients by a network, the conference clients receiving audio streams from respective participants and having respective conference graphical user interfaces (GUIs), comprising: by the conference server, (1) receiving audio streams from the conference clients, mixing the audio streams together to generate a conference audio feed, and sending the conference audio feed to the conference clients, and (2) generating conference graphical content and sending the conference graphical content to the conference clients for local rendering in the respective conference GUIs; applying audio detection processing and speech recognition processing to an input audio stream of a participant to generate distinct audio detection and speech recognition signals, the audio detection signal being generated upon the audio detection processing indicating presence of audio in the audio stream in both a speech audio condition and a non-speech audio condition of the audio stream of the participant, the speech recognition signal being generated upon the speech recognition processing indicating presence of speech audio in the audio stream in the speech audio condition; processing the audio detection and speech recognition signals to identify distinct conditions of (1) a silence condition, (2) the speech audio condition, and (3) the non-speech audio condition of the audio stream of the participant, the silence condition being identified by the non-generation of the audio detection signal, the speech audio condition being identified by the generation of the speech recognition signal, and the non-speech audio condition being identified by the generation of the audio detection signal while the speech recognition signal is not generated; and operating the conference GUIs to reflect the silence, speech audio, and non-speech audio conditions of the audio stream of the participant, including (a) in the speech audio condition, providing a first graphical identification of the participant as a speaking participant, (b) in both the silence condition and the non-speech audio condition, providing a second graphical identification of the participant as a non-speaking participant, and (c) in the non-speech audio condition, providing a third graphical identification of the participant as generating non-speech audio in the audio stream. 2. A method according to claim 1 , further including muting or reducing amplitude of the audio stream in the non-speech audio condition. 3. A method according to claim 1 , wherein providing the graphical indications includes applying corresponding different treatments to respective camera viewing windows of the speakers and non-speakers, the treatments being selected from relative sizes, relative positions, and relative highlighting. 4. A method according to claim 1 , wherein the applying and processing steps are performed at the conference server for the audio stream as received from the respective conference client, and wherein the conference server generates the first and second graphical indications and sends them to the respective conference client. 5. A method according to claim 1 , wherein the applying and processing steps are performed at the conference client for the audio stream of the participant. 6. A method according to claim 5 , wherein the discriminating between speech and non-speech audio at each of the conference clients generates respective discrimination results, and the discrimination results from all the conference clients are provided to the conference server to enable the conference server to identify the speaker and non-speakers and to return information regarding the speaker and non-speakers to the conference clients for controlling respective graphical user interfaces accordingly. 7. A method according to claim 1 , wherein the speech recognition processing provides a speech output and a separate confidence output indicating a level of confidence in accuracy of the speech output, and wherein a condition of no speech being recognized is based on the confidence output indicating a level of confidence below a predetermined threshold. 8. A method according to claim 1 , wherein the speech recognition processing is a secondary use of the speech recognition processing in the video conferencing system, and wherein the video conferencing system makes a distinct primary use of the speech recognition processing for obtaining speech content. 9. A method according to claim 8 , wherein the primary use includes making a transcription of a speech-carrying session. 10. A method according to claim 1 , wherein the audio detection processing is done using level detection by measuring an amplitude of an audio signal in the audio stream and comparing the measured amplitude against an amplitude threshold. 11. A method according to claim 1 , wherein there is a split of audio processing for different ones of the conference clients, the conference server performing the audio detection processing for lower-performance conference clients, and higher-performance conference clients performing their own audio detection processing. 12. A method according to claim 11 , wherein conference clients having poor network performance perform their own audio detection processing to avoid reduced speech recognition accuracy affected by sending audio samples to the conference server via the poor-performance network. 13. A non-transitory computer-readable medium storing computer program instructions, the instructions being executable by a video conferencing system having a conference server connected to conference clients by a network, the conference clients receiving audio streams from respective participants and having respective conference graphical user interfaces (GUIs), the execution of the instructions causing the video conferencing system to perform a method including: by the conference server, (1) receiving audio streams from the conference clients, mixing the audio streams together to generate a conference audio feed, and sending the conference audio feed to the conference clients, and (2) generating conference graphical content and sending the conference graphical content to the conference clients for local rendering in the respective conference GUIs; applying separate audio detection processing and speech recognition processing to an input audio stream of a participant to generate distinct audio detection and speech recognition signals, the audio detection signal being generated upon the audio detection processing indicating presence of audio in the audio stream in both a speech audio condition and a non-speech audio condition of the audio stream of the participant, the speech recognition signal being generated upon the speech recognition processing indicating presence of speech audio in the audio stream in the speech audio condition; processing the audio detection and speech recognition signals to identify distinct conditions of (1) a silence condition, (2) the speech audio condition, and (3) the non-speech audio condition of the audio stream of the participant, the silence condition being identified by the non-generation of the audio detection signal, the speech audio condition being identified by the generation of the speech recognition signal, and the non-speech audio condition being identified by the generation of the audio detection signal while the speech recognition signal is not generated; and operating the conference GUIs to reflect the silence, speech audio, and non-speech audio conditions of the audio stream of the participant, including (a) in the speech audio

Assignees

Inventors

Classifications

G10L15/08
Speech classification or search · CPC title
G10L25/78
Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title
G10L25/84
for discriminating voice from noise · CPC title
H04N7/15Primary
Conference systems · CPC title
H04L12/1827
Network arrangements for conference optimisation or adaptation · CPC title

Patent family

Related publications grouped by family.

View patent family 51210847

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9595271B2 cover?: A computer system executing a computer audio application such as video conferencing applies audio detection and speech recognition to an input audio stream to generate respective audio detection and speech recognition signals. A function is applied to the audio detection and speech recognition signals to generate a non-speech audio detection signal identifying presence of non-speech audio in th…
Who is the assignee on this patent?: Citrix Systems Inc, Getgo Inc
What technology area does this patent fall under?: Primary CPC classification H04N7/15. Mapped technology areas include Electricity.
When was this patent published?: Publication date Tue Mar 14 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).