Speaker recognition/location using neural network

US10580414B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10580414-B2
Application numberUS-201816006405-A
CountryUS
Kind codeB2
Filing dateJun 12, 2018
Priority dateMay 7, 2018
Publication dateMar 3, 2020
Grant dateMar 3, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Computing devices and methods utilizing a joint speaker location/speaker identification neural network are provided. In one example a computing device receives a multi-channel audio signal of an utterance spoken by a user. Magnitude and phase information features are extracted from the signal and inputted into a joint speaker location/speaker identification neural network that is trained via utterances from a plurality of persons. A user embedding comprising speaker identification characteristics and location characteristics is received from the neural network and compared to a plurality of enrollment embeddings extracted from the plurality of utterances that are each associated with an identity of a corresponding person. Based at least on the comparisons, the user is matched to an identity of one of the persons, and the identity of the person is outputted.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computing device, comprising: a processor; and a memory holding instructions executable by the processor to: receive a multi-channel audio signal of an utterance spoken by a user; extract magnitude features and phase information features from the signal; input the magnitude features and the phase information features into a joint speaker location and speaker identification neural network, wherein the joint speaker location and speaker identification neural network is trained using a plurality of utterances from a plurality of persons, wherein each utterance of the plurality of utterances comprises both speaker vocal characteristics and speaker spatial information that are used to train the joint speaker location and speaker identification neural network; receive from the joint speaker location and speaker identification neural network a user embedding comprising speaker identification characteristics and location characteristics; compare the user embedding to a plurality of enrollment embeddings extracted from the plurality of utterances that are each associated with an identity of a corresponding person; based at least on the comparisons, match the user to an identity of one of the persons; and output the identity of the person. 2. The computing device of claim 1 , wherein the joint speaker location and speaker identification neural network is configured to utilize the location characteristics of the user embedding to determine an angular orientation of the user with respect to a microphone array that captured the multi-channel audio signal of the utterance spoken by the user, and the instructions are executable to output the angular orientation along with the identity of the person matched to the user. 3. The computing device of claim 1 , wherein the joint speaker location and speaker identification neural network is configured to use the magnitude features to determine an angular orientation of the user with respect to a microphone array that captured the multi-channel audio signal of the utterance spoken by the user; and the instructions are executable to output the angular orientation along with the identity of the person matched to the user. 4. The computing device of claim 1 , wherein the joint speaker location and speaker identification neural network utilizes the phase information features to determine the identity of the user. 5. The computing device of claim 1 , wherein the joint speaker location and speaker identification neural network is trained via training magnitude features and training phase information features received from the utterances. 6. The computing device of claim 1 , wherein the joint speaker location and speaker identification neural network is configured to utilize the user embedding to enhance a speaker profile of the user. 7. The computing device of claim 6 , wherein the user embedding further comprises voice activity detection (VAD) characteristics, and the instructions are executable to: determine if the VAD characteristics indicate a human voice; and if the VAD characteristics do not indicate a human voice, then refrain from utilizing the user embedding to enhance the speaker profile. 8. The computing device of claim 6 , wherein the user embedding comprises voice overlap characteristics, and the instructions are executable to: determine if the voice overlap characteristics indicate that the audio signal contains speech from two or more persons; and if the voice overlap characteristics indicate that the audio signal contains speech from two or more persons, then refrain from utilizing the user embedding to enhance the speaker profile. 9. The computing device of claim 1 , wherein the computing device is a standalone device comprising a microphone array, and the microphone array captures the multi-channel audio signal of the utterance. 10. The computing device of claim 1 , wherein the computing device receives the multi-channel audio signal of the utterance from a remote device comprising a microphone that captures the audio signal. 11. The computing device of claim 1 , wherein the user is a first user and the utterance is a first utterance, the multi-channel audio signal further comprises a second utterance spoken by a second user, and the instructions are executable to determine a boundary between the first utterance and the second utterance without utilizing information from the plurality of utterances used to train the joint speaker location and speaker identification neural network. 12. At a computing device, a method comprising: receiving a multi-channel audio signal of an utterance spoken by a user; extracting magnitude features and phase information features from the signal; inputting the magnitude features and the phase information features into a joint speaker location and speaker identification neural network, wherein the joint speaker location and speaker identification neural network is trained using a plurality of utterances from a plurality of persons, wherein each utterance of the plurality of utterances comprises both speaker vocal characteristics and speaker spatial information that are used to train the joint speaker location and speaker identification neural network; receiving from the joint speaker location and speaker identification neural network a user embedding comprising speaker identification characteristics and location characteristics; comparing the user embedding to a plurality of enrollment embeddings extracted from the plurality of utterances that are each associated with an identity of a corresponding person; based at least on the comparisons, matching the user to an identity of one of the persons; and outputting the identity of the person. 13. The method of claim 12 , further comprising: utilizing the location characteristics of the user embedding to determine an angular orientation of the user with respect to a microphone array that captured the multi-channel audio signal of the utterance spoken by the user; and outputting the angular orientation along with the identity of the person matched to the user. 14. The method of claim 12 , further comprising: using the magnitude features to determine an angular orientation of the user with respect to a microphone array that captured the multi-channel audio signal of the utterance spoken by the user; and outputting the angular orientation along with the identity of the person matched to the user. 15. The method of claim 12 , further comprising utilizing the phase information features to determine the identity of the user. 16. The method of claim 12 , further comprising training the joint speaker location and speaker identification neural network via training magnitude features and training phase information features received from the utterances. 17. The method of claim 12 , further comprising utilizing the user embedding to enhance a speaker profile of the user. 18. The method of claim 17 , wherein the user embedding further comprises voice activity detection (VAD) characteristics, the method further comprising: determining if the VAD characteristics indicate a human voice; and if the VAD characteristics do not indicate a human voice, then refraining from utilizing the user embedding to enhance the speaker profile. 19. The method of claim 17 , wherein the user embedding comprises voice overlap characteristics, the method further comprising: determining if the voice overlap characteristics indicate that the audio signal contains speech from two or more persons; and

Assignees

Inventors

Classifications

  • G10L17/18Primary

    Artificial neural networks; Connectionist approaches · CPC title

  • Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

  • for testing textile webs, i.e. woven material · CPC title

  • wherein the signals are derived simultaneously · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10580414B2 cover?
Computing devices and methods utilizing a joint speaker location/speaker identification neural network are provided. In one example a computing device receives a multi-channel audio signal of an utterance spoken by a user. Magnitude and phase information features are extracted from the signal and inputted into a joint speaker location/speaker identification neural network that is trained via ut…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L17/18. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 03 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).