Distributed endpointing for speech recognition

US9818407B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9818407-B1
Application numberUS-201313761812-A
CountryUS
Kind codeB1
Filing dateFeb 7, 2013
Priority dateFeb 7, 2013
Publication dateNov 14, 2017
Grant dateNov 14, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An efficient audio streaming method and apparatus includes a client process implemented on a client or local device and a server process implemented on a remote server or server(s). The client process and server process each have speech recognition components and communicate over a network, and together efficiently manage the detection of speech in an audio signal streamed by the local device to the server for speech recognition and potentially further processing at the server. The client process monitors audio input and in a first detection stage, implements endpointing on the local device to determine when speech is detected. The client process may further determine if a “wakeword” is detected, and then the client process opens a connection and begins streaming audio to the server process via the network. The server process receives the speech audio stream and monitors the audio, implementing endpointing in the server process, to determine when to tell the client process to close the connection and stop streaming audio. The client process continues streaming audio to the server until the server process determines disconnect criteria have been met and tells the client process to stop streaming audio.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for performing distributed speech recognition, the system comprising: a local device comprising at least one processor coupled to a memory, the memory including instructions operable to be executed by the processor to perform a set of actions, configuring the processor: to receive audio using at least one microphone; to monitor audio data corresponding to the audio to detect voice activity in the audio data, to determine that the audio data comprises a wakeword, to begin transmission of the audio data to a server device in response to determining the audio data comprises the wakeword, to receive, from the server device, a confirmation that the audio data includes the wakeword, to receive, from the server device, an indication to stop the transmission of the audio data; to stop the transmission of the audio data in response to receiving the indication; and to continue to receive further audio using the at least one microphone following receipt of the indication; the server device comprising at least one processor coupled to a memory, the memory including instructions operable to be executed by the processor to perform a set of actions, configuring the processor: to begin receiving the audio data, to confirm the wakeword in the audio data, to transmit the confirmation to the local device, to determine an end of the voice activity in the audio data, and to transmit the indication to the local device in response to determining the end of the voice activity. 2. The system of claim 1 , wherein the processor of the local device is further configured to determine that the audio data includes the wakeword using Hidden Markov Model (HMM) techniques. 3. The system of claim 1 , wherein the local device processor configured to monitor the audio data to detect the voice activity in the audio data comprises the local device processor further configured to detect the voice activity by evaluating quantitative aspects of the audio data selected from a group consisting of: spectral slope between one or more frames of the audio data, energy levels of the audio data in one or more spectral bands, and signal-to-noise ratios of the audio data in one or more spectral bands. 4. A computer-implemented method, comprising: receiving, by a local device, audio using at least one microphone; monitoring, by the local device, audio data corresponding to the audio to detect voice activity in the audio data; determining, by the local device, that the audio data comprises a wakeword; starting, by the local device, transmission of the audio data to a remote device in response to determining the audio data comprises the wakeword; receiving, by the local device, a confirmation from the remote device that the transmitted audio data includes the wakeword; receiving, by the local device, an indication to stop the transmission of the audio data, from the remote device, in response to the remote device determining an end of the voice activity in the audio data; stopping, by the local device, the transmission of the audio data in response to receiving the indication; continuing to receive further audio using the at least one microphone following receipt of the indication. 5. The method of claim 4 , wherein monitoring, by the local device, the audio data comprises determining a likelihood that the voice activity is present in the audio data by evaluating quantitative aspects of the audio data selected from a group consisting of: spectral slope between one or more frames of the audio data, energy levels of the audio data in one or more spectral bands, and signal-to-noise ratios of the audio data in one or more spectral bands. 6. A computing device, comprising: a processor; a memory device including instructions operable to be executed by the processor to perform a set of actions, configuring the processor: to receive audio using at least one microphone; to monitor audio data corresponding to the audio to detect voice activity in the audio data; to determine that the audio data comprises a wakeword; to start transmission of the audio data to a remote device in response to determining the audio data comprises the wakeword; to receive, from the remote device, a confirmation that the transmitted audio data includes the wakeword; to receive an indication to stop the transmission of the audio data, from the remote device, in response to the remote device determining an end of the voice activity in the audio data; to stop the transmission of the audio data in response to receiving the indication; and to continue to receive further audio using the at least one microphone following receipt of the indication. 7. The computing device of claim 6 , wherein the processor configured to monitor the audio data to detect the voice activity in the audio data comprises the processor configured to determine a likelihood that speech is present in the audio data. 8. The computing device of claim 6 , wherein the processor is further configured to receive, from the remote device, speech recognition results based on the voice activity. 9. The computing device of claim 6 , wherein the processor is further configured to stop the transmission upon expiration of a length of time. 10. The computing device of claim 6 , wherein the processor is further configured to detect the voice activity in the audio data by evaluating quantitative aspects of the audio data selected from a group consisting of: spectral slope between one or more frames of the audio data, energy levels of the audio data in one or more spectral bands, and signal-to-noise ratios of the audio data in one or more spectral bands.

Assignees

Inventors

Classifications

  • Hidden Markov Models [HMMs] · CPC title

  • Detection of discrete points within a voice signal · CPC title

  • the extracted parameters being the cepstrum · CPC title

  • Segmentation; Word boundary detection · CPC title

  • G10L25/78Primary

    Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9818407B1 cover?
An efficient audio streaming method and apparatus includes a client process implemented on a client or local device and a server process implemented on a remote server or server(s). The client process and server process each have speech recognition components and communicate over a network, and together efficiently manage the detection of speech in an audio signal streamed by the local device t…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L25/78. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 14 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).