Mixed speech recognition

US9390712B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9390712-B2
Application numberUS-201414223468-A
CountryUS
Kind codeB2
Filing dateMar 24, 2014
Priority dateMar 24, 2014
Publication dateJul 12, 2016
Grant dateJul 12, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The claimed subject matter includes a system and method for recognizing mixed speech from a source. The method includes training a first neural network to recognize the speech signal spoken by the speaker with a higher level of a speech characteristic from a mixed speech sample. The method also includes training a second neural network to recognize the speech signal spoken by the speaker with a lower level of the speech characteristic from the mixed speech sample. Additionally, the method includes decoding the mixed speech sample with the first neural network and the second neural network by optimizing the joint likelihood of observing the two speech signals considering the probability that a specific frame is a switching point of the speech characteristic.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by a computer processor for recognizing mixed speech from a source, comprising: training a first neural network to recognize a speech signal spoken by a speaker with a higher level of a speech characteristic from a mixed speech sample; training a second neural network to recognize a speech signal spoken by a speaker with a lower level of the speech characteristic from the mixed speech sample, wherein the lower level is lower than the higher level; and decoding the mixed speech sample with the first neural network and the second neural network by optimizing the joint likelihood of observing the two speech signals. 2. The method of claim 1 , comprising decoding by considering a probability that a specific frame is a switching point of the speakers. 3. The method of claim 2 , comprising compensating for the switching point occurring in a decoding process based on the switching probability estimated from another neural network. 4. The method of claim 1 , the mixed speech sample comprising a single audio channel, the single audio channel being generated by a microphone. 5. The method of claim 1 , the speech characteristic comprising one of: instantaneous energy in a frame of the mixed speech sample; energy; and pitch. 6. The method of claim 1 , comprising: training a third neural network to predict speech characteristic switching; predicting whether energy is switching from one frame to a next frame; and decoding the mixed speech sample based on the prediction. 7. The method of claim 6 , comprising weighting against the likelihood of energy switching in a frame subsequent to a frame where energy switching is predicted. 8. A system for recognizing mixed speech from a source, the system comprising: a first neural network comprising a first plurality of interconnected systems; and a second neural network comprising a second plurality of interconnected systems, each interconnected system, comprising: a processing unit; and a system memory, wherein the system memory comprises code configured to direct the processing unit to: train the first neural network to recognize a higher level of a speech characteristic in a first speech signal from a mixed speech sample; train the second neural network to recognize a lower level of the speech characteristic in a second speech signal from the mixed speech sample, wherein the lower level is lower than the higher level; and decode the mixed speech sample with the first neural network and the second neural network by optimizing the joint likelihood of observing the two speech signals. 9. The system of claim 8 , comprising code configured to decode the mixed speech sample by considering a probability that a specific frame is a switching point of the speech characteristic. 10. The system of claim 8 , comprising code configured to direct the processing unit to compensate for the switching point occurring in a decoding process based on the probability estimated from a neural network. 11. The system of claim 8 , the first neural network and the second neural network comprising deep neural networks. 12. The system of claim 8 , the speech characteristic comprising a selected one of pitch, energy, and instantaneous energy, in a frame of the mixed speech sample. 13. The system of claim 8 , comprising code configured to direct the processing unit to: train a third neural network to predict energy switching; predict whether energy is switching from one frame to a next frame; and decode the mixed speech sample based on the prediction. 14. The system of claim 13 , comprising weighting against the likelihood of energy switching in a frame subsequent to a frame where energy switching is predicted. 15. One or more computer-readable storage memory devices for storing computer-readable instructions, the computer-readable instructions when executed by one or more processing devices, the computer-readable instructions comprising code configured to: train a first neural network to recognize a higher level of a speech characteristic in a first speech signal from a mixed speech sample comprising a single audio channel; train a second neural network to recognize a lower level of the speech characteristic in a second speech signal from the mixed speech sample; train a third neural network to estimate a switching probability for each frame; and decode the mixed speech sample with the first neural network, the second neural network, and the third neural network by optimizing the joint likelihood of observing the two speech signals, the joint likelihood meaning a probability that a specific frame is a switching point of the speech characteristic. 16. The computer-readable storage memory devices of claim 15 , comprising code configured to decode the mixed speech sample by considering a probability that a specific frame is a switching point of the speech characteristic. 17. The computer-readable storage memory devices of claim 15 , comprising code configured to compensate for the switching point occurring in a decoding process based on the joint likelihood. 18. The computer-readable storage memory devices of claim 15 , wherein the speech characteristic is a selected one of energy, pitch, and instantaneous energy in a frame of the mixed speech sample. 19. The computer-readable storage memory devices of claim 15 , wherein the speech characteristic is instantaneous energy in a frame of the mixed speech sample. 20. The computer-readable storage memory devices of claim 15 , comprising code configured to: train a third neural network to predict energy switching; predict whether energy is switching from one frame to a next frame; and decode the mixed speech sample based on the prediction.

Assignees

Inventors

Classifications

  • the extracted parameters being power information · CPC title

  • G10L15/063Primary

    Training · CPC title

  • Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title

  • G10L15/16Primary

    using artificial neural networks · CPC title

  • Pitch determination of speech signals · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9390712B2 cover?
The claimed subject matter includes a system and method for recognizing mixed speech from a source. The method includes training a first neural network to recognize the speech signal spoken by the speaker with a higher level of a speech characteristic from a mixed speech sample. The method also includes training a second neural network to recognize the speech signal spoken by the speaker with a…
Who is the assignee on this patent?
Microsoft Corp, Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 12 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).