Neural Networks for Speaker Verification
US-2019043508-A1 · Feb 7, 2019 · US
US10580414B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10580414-B2 |
| Application number | US-201816006405-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 12, 2018 |
| Priority date | May 7, 2018 |
| Publication date | Mar 3, 2020 |
| Grant date | Mar 3, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Computing devices and methods utilizing a joint speaker location/speaker identification neural network are provided. In one example a computing device receives a multi-channel audio signal of an utterance spoken by a user. Magnitude and phase information features are extracted from the signal and inputted into a joint speaker location/speaker identification neural network that is trained via utterances from a plurality of persons. A user embedding comprising speaker identification characteristics and location characteristics is received from the neural network and compared to a plurality of enrollment embeddings extracted from the plurality of utterances that are each associated with an identity of a corresponding person. Based at least on the comparisons, the user is matched to an identity of one of the persons, and the identity of the person is outputted.
Opening claim text (preview).
The invention claimed is: 1. A computing device, comprising: a processor; and a memory holding instructions executable by the processor to: receive a multi-channel audio signal of an utterance spoken by a user; extract magnitude features and phase information features from the signal; input the magnitude features and the phase information features into a joint speaker location and speaker identification neural network, wherein the joint speaker location and speaker identification neural network is trained using a plurality of utterances from a plurality of persons, wherein each utterance of the plurality of utterances comprises both speaker vocal characteristics and speaker spatial information that are used to train the joint speaker location and speaker identification neural network; receive from the joint speaker location and speaker identification neural network a user embedding comprising speaker identification characteristics and location characteristics; compare the user embedding to a plurality of enrollment embeddings extracted from the plurality of utterances that are each associated with an identity of a corresponding person; based at least on the comparisons, match the user to an identity of one of the persons; and output the identity of the person. 2. The computing device of claim 1 , wherein the joint speaker location and speaker identification neural network is configured to utilize the location characteristics of the user embedding to determine an angular orientation of the user with respect to a microphone array that captured the multi-channel audio signal of the utterance spoken by the user, and the instructions are executable to output the angular orientation along with the identity of the person matched to the user. 3. The computing device of claim 1 , wherein the joint speaker location and speaker identification neural network is configured to use the magnitude features to determine an angular orientation of the user with respect to a microphone array that captured the multi-channel audio signal of the utterance spoken by the user; and the instructions are executable to output the angular orientation along with the identity of the person matched to the user. 4. The computing device of claim 1 , wherein the joint speaker location and speaker identification neural network utilizes the phase information features to determine the identity of the user. 5. The computing device of claim 1 , wherein the joint speaker location and speaker identification neural network is trained via training magnitude features and training phase information features received from the utterances. 6. The computing device of claim 1 , wherein the joint speaker location and speaker identification neural network is configured to utilize the user embedding to enhance a speaker profile of the user. 7. The computing device of claim 6 , wherein the user embedding further comprises voice activity detection (VAD) characteristics, and the instructions are executable to: determine if the VAD characteristics indicate a human voice; and if the VAD characteristics do not indicate a human voice, then refrain from utilizing the user embedding to enhance the speaker profile. 8. The computing device of claim 6 , wherein the user embedding comprises voice overlap characteristics, and the instructions are executable to: determine if the voice overlap characteristics indicate that the audio signal contains speech from two or more persons; and if the voice overlap characteristics indicate that the audio signal contains speech from two or more persons, then refrain from utilizing the user embedding to enhance the speaker profile. 9. The computing device of claim 1 , wherein the computing device is a standalone device comprising a microphone array, and the microphone array captures the multi-channel audio signal of the utterance. 10. The computing device of claim 1 , wherein the computing device receives the multi-channel audio signal of the utterance from a remote device comprising a microphone that captures the audio signal. 11. The computing device of claim 1 , wherein the user is a first user and the utterance is a first utterance, the multi-channel audio signal further comprises a second utterance spoken by a second user, and the instructions are executable to determine a boundary between the first utterance and the second utterance without utilizing information from the plurality of utterances used to train the joint speaker location and speaker identification neural network. 12. At a computing device, a method comprising: receiving a multi-channel audio signal of an utterance spoken by a user; extracting magnitude features and phase information features from the signal; inputting the magnitude features and the phase information features into a joint speaker location and speaker identification neural network, wherein the joint speaker location and speaker identification neural network is trained using a plurality of utterances from a plurality of persons, wherein each utterance of the plurality of utterances comprises both speaker vocal characteristics and speaker spatial information that are used to train the joint speaker location and speaker identification neural network; receiving from the joint speaker location and speaker identification neural network a user embedding comprising speaker identification characteristics and location characteristics; comparing the user embedding to a plurality of enrollment embeddings extracted from the plurality of utterances that are each associated with an identity of a corresponding person; based at least on the comparisons, matching the user to an identity of one of the persons; and outputting the identity of the person. 13. The method of claim 12 , further comprising: utilizing the location characteristics of the user embedding to determine an angular orientation of the user with respect to a microphone array that captured the multi-channel audio signal of the utterance spoken by the user; and outputting the angular orientation along with the identity of the person matched to the user. 14. The method of claim 12 , further comprising: using the magnitude features to determine an angular orientation of the user with respect to a microphone array that captured the multi-channel audio signal of the utterance spoken by the user; and outputting the angular orientation along with the identity of the person matched to the user. 15. The method of claim 12 , further comprising utilizing the phase information features to determine the identity of the user. 16. The method of claim 12 , further comprising training the joint speaker location and speaker identification neural network via training magnitude features and training phase information features received from the utterances. 17. The method of claim 12 , further comprising utilizing the user embedding to enhance a speaker profile of the user. 18. The method of claim 17 , wherein the user embedding further comprises voice activity detection (VAD) characteristics, the method further comprising: determining if the VAD characteristics indicate a human voice; and if the VAD characteristics do not indicate a human voice, then refraining from utilizing the user embedding to enhance the speaker profile. 19. The method of claim 17 , wherein the user embedding comprises voice overlap characteristics, the method further comprising: determining if the voice overlap characteristics indicate that the audio signal contains speech from two or more persons; and
Artificial neural networks; Connectionist approaches · CPC title
Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title
for testing textile webs, i.e. woven material · CPC title
wherein the signals are derived simultaneously · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.