Methods and apparatus for speech segmentation using multiple metadata

US10229686B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10229686-B2
Application numberUS-201415329354-A
CountryUS
Kind codeB2
Filing dateAug 18, 2014
Priority dateAug 18, 2014
Publication dateMar 12, 2019
Grant dateMar 12, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and apparatus to process microphone signals by a speech enhancement module to generate an audio stream signal including first and second metadata for use by a speech recognition module. In an embodiment, speech recognition is performed using endpointing information including transitioning from a silence state to a maybe speech state, in which data is buffered, based on the first metadata and transitioning to a speech state, in which speech recognition is performed, based upon the second metadata.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method of performing automated speech recognition (ASR) in a system having a speech enhancement module for generating an audio stream signal and metadata, coupled to an ASR module for performing speech recognition on the audio stream signal using the metadata, the method comprising: by the speech enhancement module, processing microphone signals to generate the audio stream signal; by a first speech detector having a first response latency, generating first metadata that indicate the possible presence of speech in the audio stream signal with a first confidence level; by a second speech detector having a second response latency that is higher than the first response latency, generating second metadata that indicate the possible presence of speech in the audio stream signal with a second confidence level that is higher than the first confidence level; by the ASR module based on the first metadata, initiating buffering of the audio stream signal from an endpoint; and by the ASR module based on the second metadata, initiating speech recognition on the buffered audio stream signal from the endpoint. 2. The method according to claim 1 , wherein the first metadata has a frame-by-frame time scale. 3. The method according to claim 1 , wherein the second metadata has a sequence of frames time scale. 4. The method according to claim 1 , further including performing one or more of barge-in, beamforming, and/or echo cancellation for generating the first and/or second metadata. 5. The method according to claim 1 , further including tuning a speech detection threshold for a given latency for the first metadata. 6. The method according to claim 1 , further including adjusting latency for a given confidence level of voice activity detection for the second metadata. 7. The method according to claim 1 , further including controlling computation of the second metadata using the first metadata or computation of the first metadata using the second metadata. 8. The method according to claim 1 , further including performing one or more of barge-in, beamforming, and/or echo cancellation for generating further metadata. 9. The method according to claim 1 , wherein at least one of the first and second metadata is encoded into the audio signal. 10. An article, comprising a non-transitory computer readable medium having stored instructions that when executed perform a method of automated speech recognition (ASR) in a system having a speech enhancement module for generating an audio stream signal and metadata, coupled to an ASR module for performing speech recognition on the audio stream signal using the metadata, the method comprising: by the speech enhancement module, processing microphone signals to generate the audio stream signal; by a first speech detector having a first response latency, generating first metadata that indicate the possible presence of speech in the audio stream signal with a first confidence level; by a second speech detector having a second response latency that is higher than the first response latency, generating second metadata that indicate the possible presence of speech in the audio stream signal with a second confidence level that is higher than the first confidence level; by the ASR module based on the first metadata, initiating buffering of the audio stream signal from an endpoint; and by the ASR module based on the second metadata, initiating speech recognition on the buffered audio stream signal from the endpoint. 11. The article according to claim 10 , wherein the first metadata has a frame-by-frame time scale. 12. The article according to claim 10 , wherein the second metadata has a sequence of frames time scale. 13. The article according to claim 10 , further including instructions to perform one or more of barge-in, beamforming, and/or echo cancellation for generating the first and second metadata. 14. The article according to claim 10 , further including instructions to tune speech detector parameters for a given latency for the first metadata. 15. The article according to claim 10 , further including instructions to adjust latency for a given confidence level of voice activity detection for the second metadata. 16. The article according to claim 10 , further including instructions to control computation of the second metadata using the first metadata or computation of the first metadata using the second metadata. 17. The article according to claim 10 , further including instructions to perform one or more of barge-in, beamforming, and/or echo cancellation for generating further metadata. 18. A system for performing automated speech recognition (ASR) comprising a speech enhancement module for generating an audio stream signal and metadata, coupled to an ASR module for performing speech recognition on the audio stream signal using the metadata, the system further comprising: in the speech enhancement module, electronic circuitry configured to provide: a first speech detector having a first response latency for generating first metadata that indicate the possible presence of speech in the audio stream signal with a first confidence level; and a second speech detector having a second response latency that is higher than the first response latency for generating second metadata that indicate the possible presence of speech in the audio stream signal with a second confidence level that is higher than the first confidence level; and in the ASR module, electronic circuitry configured to provide: an endpointing module for initiating, based on the first metadata, buffering of the audio stream signal from an endpoint, and for initiating, based on the second metadata, speech recognition on the buffered audio stream signal from the endpoint. 19. The system according to claim 18 , further including a further speech detector to perform one or more of barge-in, beamforming, and/or echo cancellation for generating further metadata for use by the endpointing module. 20. The system according to claim 18 , wherein the first speech detector is further configured to tune detector parameters for a given latency for the first metadata. 21. The system according to claim 18 , wherein the second speech detector is further configured to adjust latency for a given confidence level of voice activity detection using the second metadata.

Assignees

Inventors

Classifications

  • G10L25/78Primary

    Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

  • Speech enhancement, e.g. noise reduction or echo cancellation (reducing echo effects in line transmission systems H04B3/20; echo suppression in hands-free telephones H04M9/08) · CPC title

  • G10L15/28Primary

    Constructional details of speech recognition systems · CPC title

  • Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10229686B2 cover?
Methods and apparatus to process microphone signals by a speech enhancement module to generate an audio stream signal including first and second metadata for use by a speech recognition module. In an embodiment, speech recognition is performed using endpointing information including transitioning from a silence state to a maybe speech state, in which data is buffered, based on the first metadat…
Who is the assignee on this patent?
Nuance Communications Inc
What technology area does this patent fall under?
Primary CPC classification G10L25/78. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 12 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).