Sound source separation for robot from target voice direction and noise voice direction
US-10665249-B2 · May 26, 2020 · US
US10818311B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10818311-B2 |
| Application number | US-201816632373-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 14, 2018 |
| Priority date | Nov 15, 2017 |
| Publication date | Oct 27, 2020 |
| Grant date | Oct 27, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An auditory selection method based on a memory and attention model, including: step S1, encoding an original speech signal into a time-frequency matrix; step S2, encoding and transforming the time-frequency matrix to convert the matrix into a speech vector; step S3, using a long-term memory unit to store a speaker and a speech vector corresponding to the speaker; step S4, obtaining a speech vector corresponding to a target speaker, and separating a target speech from the original speech signal through an attention selection model. A storage device includes a plurality of programs stored in the storage device. The plurality of programs are configured to be loaded by a processor and execute the auditory selection method based on the memory and attention model. A processing unit includes the processor and the storage device.
Opening claim text (preview).
What is claimed is: 1. An auditory selection method based on a memory and attention model, comprising: encoding an original speech signal into a matrix containing time-frequency dimensions; encoding and transforming the matrix containing the time-frequency dimensions to convert the matrix containing the time-frequency dimensions into a speech vector using a bi-directional long short-term memory (BiLSTM) network model to encode the matrix containing the time-frequency dimensions in a sequential order and in a reverse order, respectively, to obtain a first hidden layer vector and a second hidden layer vector, respectively; wherein, the BiLSTM network model is configured to encode the matrix containing the time-frequency dimensions to obtain a hidden layer vector, and a formula of the BiLSTM network model comprises: i t =σ( W xi x t +W hi h t-1 +W ci c t-1 +b i ) f t =σ( W xf x t +W hf h t-1 +W cf c t-1 +b f ) c t =f t c t-1 +i t tan h ( W xc x t +W hc h t-1 +b c ) o t =σ( W xo x t +W ho h t-1 +W co c t-1 +b o ) h t =o t tan h ( c t ) where, i, f, c, o, and h respectively represent an input gate, a forget gate, a storage unit, an output gate, and the hidden layer vector of the BiLSTM network model, σ represents a Sigmoid function, x represents an input vector, and t represents a time; where, W xi , W hi ,and W ci respectively represent an encoding matrix parameter of an input vector x t in the input gate at a current time, an encoding matrix parameter of the hidden layer vector h t-1 in the input gate at a previous time, and an encoding matrix parameter of a memory unit C t-1 in the input gate at the previous time; b i represents an information bias parameter in the input gate; where, W xf , W hf , and W cf respectively represent an encoding matrix parameter of the input vector x t in the forget gate at the current time, an encoding matrix parameter of the hidden layer vector h t-1 in the forget gate at the previous time, and an encoding matrix parameter of the memory unit C t-1 in the forget gate at the previous time; b f represents an information bias parameter in the forget gate; where, W xc and W hc respectively represent an encoding matrix parameter of the input vector X t in the storage unit at the current time and an encoding matrix parameter of the hidden layer vector h t-1 in the storage unit at the previous time; b c represents an information bias parameter in the storage unit; and where, W xo , W ho , and W co respectively represent an encoding matrix parameter of the input vector x t in the output gate at the current time, an encoding matrix parameter of the hidden layer vector h t-1 in the output gate at the previous time, and an encoding matrix parameter of the memory unit C t-1 in the output gate at the previous time; b o represents an information bias parameter in the output gate; storing a speaker and a speech vector corresponding to the speaker in a long-term memory unit; obtaining a speech vector corresponding to a target speaker from the long-term memory unit; and according to the speech vector corresponding to the target speaker, separating a target speech from the original speech signal by an attention selection model. 2. The auditory selection method based on the memory and attention model according to claim 1 , wherein, before “encoding the original speech signal into the matrix containing the time-frequency dimensions”, the auditory selection method further comprises: resampling the original speech signal to form a resampled speech signal, and filtering the resampled speech signal to reduce a sampling rate of the original speech signal. 3. The auditory selection method based on the memory and attention model according to claim 2 , wherein, the step of “encoding and transforming the matrix containing the time-frequency dimensions to convert the matrix containing the time-frequency dimensions into the speech vector” comprises: fusing the first hidden layer vector with the second hidden layer vector at a time corresponding to the first hidden layer vector to obtain a third hidden layer vector; and converting the third hidden layer vector into the speech vector through a fully connected layer; wherein, the matrix containing the time-frequency dimensions is encoded in sequential order at a first time and the matrix containing the time-frequency dimensions is encoded in reverse order at a second time, and the first time corresponds to the second time. 4. The auditory selection method based on the memory and attention model according to claim 3 , wherein, the step of “fusing the first hidden layer vector with the second hidden layer vector at the time corresponding to the first hidden layer vector” comprises: adding the first hidden layer vector to the second hidden layer vector, or calculating an average value of the first hidden layer vector and the second hidden layer vector, or splicing the first hidden layer vector and the second hidden layer vector end to end. 5. The auditory selection method based on the memory and attention model according to claim 1 , wherein, the step of “storing the speaker and the speech vector corresponding to the speaker in the long-term memory unit” comprises: storing the speaker and the speech vector corresponding to the speaker in the long-term memory unit in a Key-Value form, wherein a Key is configured to store an index of the speaker and a Value is configured to store the speech vector corresponding to the speaker. 6. The auditory selection method based on the memory and attention model according to claim 5 , wherein, after “storing the speaker and the speech vector corresponding to the speaker in the long-term memory unit”, the auditory selection method further comprises: when the speaker generates a new speech, extracting a new speech vector of the new speech of the speaker, and updating the speech vector of the speaker stored in the long-term memory unit to replace an original speech vector of the speaker with the new speech vector. 7. The auditory selection method based on the memory and attention model according to claim 6 , wherein, the step of “updating the speech vector of the speaker” comprises: after the new speech vector of the speaker is extracted, adding the new speech vector to the original speech vector of the speaker in the long-term memory unit, normalizing amplitudes in an obtained result, wherein a formula of normalizing the amplitudes in the obtained result is as follows: v = q + v 1 q + v 1 , where, q represents a new speech vector generated by the speaker, v 1 represents the original speech vector of the speaker, and V represents an updated speech vector of the speaker. 8. The auditory selection method based on the memory and attention model according to claim 1 , wherein,
Recurrent networks, e.g. Hopfield networks · CPC title
Supervised learning · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Learning methods · CPC title
the noise being separate speech, e.g. cocktail party · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.