Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 03 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Mixed speech recognition

US9779727B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9779727-B2
Application number	US-201615395640-A
Country	US
Kind code	B2
Filing date	Dec 30, 2016
Priority date	Mar 24, 2014
Publication date	Oct 3, 2017
Grant date	Oct 3, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The claimed subject matter includes a system and method for recognizing mixed speech from a source. The method includes training a first neural network to recognize the speech signal spoken by the speaker with a higher level of a speech characteristic from a mixed speech sample. The method also includes training a second neural network to recognize the speech signal spoken by the speaker with a lower level of the speech characteristic from the mixed speech sample. Additionally, the method includes decoding the mixed speech sample with the first neural network and the second neural network by optimizing the joint likelihood of observing the two speech signals considering the probability that a specific frame is a switching point of the speech characteristic.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by a computer processor for recognizing mixed speech from a source, comprising: training a first neural network to recognize a speech signal spoken by a speaker with a higher level of a speech characteristic from a mixed speech sample; training a second neural network to recognize a speech signal spoken by a speaker with a lower level of the speech characteristic from the mixed speech sample, wherein the lower level is lower than the higher level; and joint decoding the mixed speech sample with the first neural network and the second neural network using a dataset for high amplitude characteristic data and a dataset for low amplitude characteristic data, wherein joint decoding is performed by finding two state sequences in a two-dimensional joint space of the mixed speech sample such that a sum of a state sequence log likelihood of the two state sequences is maximized. 2. The method of claim 1 , wherein the speech characteristic is an energy of the speech signal. 3. The method of claim 2 , wherein training the first neural network comprises training the first neural network to identify the speech signal with a positive signal to noise ratio(SNR). 4. The method of claim 2 , wherein training the second neural network comprises training the second neural network to identify the speech signal with a negative SNR. 5. The method of claim 3 , wherein training the second neural network comprises training the second neural network to identify the speech signal with a negative SNR. 6. The method of claim 2 , wherein the first neural network and the second neural network are trained by: performing energy normalization on a training data set; adding random samples into the energy normalized training data set; and scaling an energy of a portion of the random samples to have higher energy than a remaining portion of the random samples. 7. The method of claim 2 , wherein the first neural network and the second neural network are trained by: performing energy normalization on a training data set; adding random samples into the energy normalized training data set; and scaling an energy of a portion of the random samples to have lower energy than a remaining portion of the random samples. 8. The method of claim 1 , wherein the speech characteristic is a pitch of the speech signal. 9. The method of claim 1 , wherein the speech characteristic is an instantaneous energy of the speech signal. 10. A system for recognizing mixed speech from a source, the system comprising: a first neural network comprising a first plurality of interconnected systems; and a second neural network comprising a second plurality of interconnected systems, each interconnected system, comprising: a processing unit; and a system memory, wherein the system memory comprises code configured to direct the processing unit to: train a first neural network to recognize a speech signal spoken by a speaker with a higher level of a speech characteristic from a mixed speech sample; train a second neural network to recognize a speech signal spoken by a speaker with a lower level of the speech characteristic from the mixed speech sample, wherein the lower level is lower than the higher level; and joint decode the mixed speech sample with the first neural network and the second neural network using a dataset for high amplitude characteristic data and a dataset for low amplitude characteristic data, wherein joint decoding is performed by finding two state sequences in a two-dimensional joint space of the mixed speech sample such that a sum of a state sequence log likelihood of the two state sequences is maximized. 11. The system of claim 10 , wherein the speech characteristic is an energy of the speech signal. 12. The system of claim 11 , wherein the first neural network is trained to identify the speech signal with a positive signal to noise ratio(SNR), and wherein the second neural network is trained to identify the speech signal with a negative SNR. 13. The system of claim 11 , wherein the first neural network and the second neural network are trained by: performing energy normalization on a training data set; adding random samples into the energy normalized training data set; scaling an energy of a portion of the random samples to have higher energy than a remaining portion of the random samples. 14. The system of claim 11 , wherein the first neural network and the second neural network are trained by: performing energy normalization on a training data set; adding random samples into the energy normalized training data set; scaling an energy of a portion of the random samples to have lower energy than a remaining portion of the random samples. 15. The system of claim 10 , wherein the speech characteristic is a pitch of the speech signal. 16. The system of claim 15 , wherein the first neural network and the second neural network are trained by using a pitch estimate, for target speech signals and interfering speech signals, to select labels for training. 17. The system of claim 10 , wherein the speech characteristic is an instantaneous energy of the speech signal. 18. The system of claim 16 , wherein the code is configured to direct the processing unit to generate a training set by: mixing the target speech signals and the interfering speech signals; determining instantaneous frame energies of the target speech signals; and determining instantaneous frame energies of the interfering speech signals. 19. One or more computer-readable storage memory devices for storing computer-readable instructions, the computer-readable instructions when executed by one or more processing devices, the computer-readable instructions comprising code configured to: train a first neural network to recognize a speech signal spoken by a speaker with a higher level of a speech characteristic from a mixed speech sample; train a second neural network to recognize a speech signal spoken by a speaker with a lower level of the speech characteristic from the mixed speech sample, wherein the lower level is lower than the higher level; and joint decode the mixed speech sample with the first neural network and the second neural network using a dataset for high amplitude characteristic data and a dataset for low amplitude characteristic data, wherein joint decoding is performed by finding two state sequences in a two-dimensional joint space of the mixed speech sample such that a sum of a state sequence log likelihood of the two state sequences is maximized. 20. The computer-readable storage memory devices of claim 19 , wherein the speech characteristic is one of: an energy of the speech signal; a pitch of the speech signal; and an instantaneous energy of the speech signal.

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G10L25/84
for discriminating voice from noise · CPC title
G10L15/16Primary
using artificial neural networks · CPC title
G10L15/20
Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title
G10L25/21
the extracted parameters being power information · CPC title
G10L17/18
Artificial neural networks; Connectionist approaches · CPC title

Patent family

Related publications grouped by family.

View patent family 52808176

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9779727B2 cover?: The claimed subject matter includes a system and method for recognizing mixed speech from a source. The method includes training a first neural network to recognize the speech signal spoken by the speaker with a higher level of a speech characteristic from a mixed speech sample. The method also includes training a second neural network to recognize the speech signal spoken by the speaker with a…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 03 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).