Speaker diarization using speaker embedding(s) and trained generative model
US-2021217411-A1 · Jul 15, 2021 · US
US12555565B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12555565-B2 |
| Application number | US-202217951585-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 23, 2022 |
| Priority date | Sep 30, 2021 |
| Publication date | Feb 17, 2026 |
| Grant date | Feb 17, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A processor-implemented method includes: extracting a target speaker voice feature based on an input voice of a target speaker; determining an utterance scenario of the input voice based on the target speaker voice feature; generating a final target speaker voice feature based on the determined utterance scenario; and determining whether the target speaker corresponds to a user based on the final target speaker voice feature and a final user voice feature, wherein the determined utterance scenario comprises either one of a single-speaker scenario and a multiple-speaker scenario.
Opening claim text (preview).
What is claimed is: 1 . A processor-implemented method, the method comprising: generating an original voice feature based on an input voice of a target speaker; extracting, by a first neural network, a target speaker voice feature based on the original voice feature; determining whether an utterance scenario of the input voice is a single-speaker scenario or a multiple-speaker scenario by comparing the original voice feature with the target speaker voice feature; generating, by inputting the original voice feature or the extracted target speaker feature into a second neural network, using a result of the determination of the utterance scenario, a final target speaker voice feature; and generating a user verification result by determining whether the target speaker corresponds to a user by comparing the final target speaker voice feature with a final user voice feature, wherein the determined utterance scenario comprises a single-speaker scenario and a multiple-speaker scenario. 2 . The method of claim 1 , wherein the extracting of the target speaker voice feature comprises: inputting the original voice feature and a middle user embedding voice feature to the first neural network. 3 . The method of claim 2 , wherein the first neural network comprises: a first convolution layer configured to output a speaker extraction embedding feature, for extracting the target speaker voice feature included in the input voice, based on an input of the original voice feature; a splicing layer configured to output a splicing feature based on an input of the speaker extraction embedding feature and the middle user embedding voice feature; a second convolution layer configured to output a mask based on an input of the splicing feature; and a multiplier configured to output the target speaker voice feature based on an input of the mask and the speaker extraction embedding feature. 4 . The method of claim 1 , wherein the determining of the utterance scenario of the input voice by comparing the original voice feature and the target speaker voice feature comprises either of: in response to a mean squared error between the original voice feature and the target speaker voice feature being less than a threshold value, determining the utterance scenario as the single-speaker scenario; and in response to the mean squared error between the original voice feature and the target speaker voice feature being greater than or equal to the threshold value, determining the utterance scenario as the multiple-speaker scenario. 5 . The method of claim 1 , wherein, by the second neural network, the generating of the final target speaker voice feature comprises: in response to determining the utterance scenario as the single-speaker scenario, inputting the original voice feature to the second neural network; and in response to determining the utterance scenario as the multiple-speaker scenario, inputting the target speaker voice feature to the second neural network. 6 . The method of claim 5 , wherein the second neural network comprises: a speaker embedding layer configured to output a target speaker middle embedding voice feature based on an input of either one of the original voice feature and the target speaker voice feature; and an attentive statistics pooling layer configured to output the final target speaker voice feature based on an input of the target speaker middle embedding voice feature. 7 . The method of claim 6 , wherein the determining of whether the target speaker corresponds to the user comprises: determining a similarity value between the final target speaker voice feature and the final user voice feature; and determining whether the target speaker corresponds to the user based on the determined similarity value. 8 . The method of claim 7 , wherein the middle user embedding voice feature is generated as a result of inputting a user voice feature, which is generated based on an input of a user voice, to the speaker embedding layer, and the final user voice feature is generated as a result of inputting the middle user embedding voice feature to the attentive statistics pooling layer. 9 . The method of claim 8 , wherein the first neural network and the second neural network are trained jointly with a third network for converting a speaker voice feature based on a speaker middle embedding voice feature. 10 . An electronic device comprising: one or more processors; a memory comprising one or more non-transitory storage media that store instructions that, when executed by the one or more processors, configures the device to: generate an original voice feature based on an input voice of a target speaker; extract, by a first neural network, a target speaker voice feature based on the original voice feature; determine whether an utterance scenario of the input voice is a single-speaker scenario or a multiple-speaker scenario by comparing the original voice feature with the target speaker voice feature; generate, by inputting the original voice feature or the extracted target speaker feature into a second neural network, using a result of the determination of the utterance scenario, a final target speaker voice feature; and generate a user verification result by determining whether the target speaker corresponds to a user by comparing the final target speaker voice feature with a final user voice feature, wherein the determined utterance scenario comprises a single-speaker scenario and a multiple-speaker scenario. 11 . The electronic device of claim 10 , wherein, for the extracting of the target speaker voice feature, the execution of the instructions further configures the device to: inputting the original voice feature and a middle user embedding voice feature to the first neural network. 12 . The electronic device of claim 11 , wherein the first neural network comprises: a first convolution layer configured to output a speaker extraction embedding feature, for extracting the target speaker voice feature included in the input voice, based on an input of the original voice feature; a splicing layer which outputs a splicing feature based on an input of the speaker extraction embedding feature and the middle user embedding voice feature; a second convolution layer which outputs a mask based on an input of the splicing feature; and a multiplier which outputs the target speaker voice feature based on an input of the mask and the speaker extraction embedding feature. 13 . The electronic device of claim 10 , wherein, for the determining of the utterance scenario of the input voice by comparing the original voice feature and the target speaker voice feature, the execution of the instructions further configures the device to: in response to a mean squared error between the original voice feature and the target speaker voice feature being less than a threshold value, determine the utterance scenario as the single-speaker scenario; and in response to the mean squared error between the original voice feature and the target speaker voice feature being greater than or equal to the threshold value, determine the utterance scenario as the multiple-speaker scenario. 14 . The electronic device of claim 13 , wherein, for the generating of the final target speaker voice feature, the execution of the instructions further configures the device to: in response to determining the utterance scenario as the single-speaker scenario, generate the final target speaker voice feature by inputting the original voice feature to the second neural network; and in response to determining the utterance scenario as the mu
Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title
for comparison or discrimination · CPC title
using neural networks · CPC title
Artificial neural networks; Connectionist approaches · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.