Device and method with target speaker identification

US12555565B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12555565-B2
Application numberUS-202217951585-A
CountryUS
Kind codeB2
Filing dateSep 23, 2022
Priority dateSep 30, 2021
Publication dateFeb 17, 2026
Grant dateFeb 17, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A processor-implemented method includes: extracting a target speaker voice feature based on an input voice of a target speaker; determining an utterance scenario of the input voice based on the target speaker voice feature; generating a final target speaker voice feature based on the determined utterance scenario; and determining whether the target speaker corresponds to a user based on the final target speaker voice feature and a final user voice feature, wherein the determined utterance scenario comprises either one of a single-speaker scenario and a multiple-speaker scenario.

First claim

Opening claim text (preview).

What is claimed is: 1 . A processor-implemented method, the method comprising: generating an original voice feature based on an input voice of a target speaker; extracting, by a first neural network, a target speaker voice feature based on the original voice feature; determining whether an utterance scenario of the input voice is a single-speaker scenario or a multiple-speaker scenario by comparing the original voice feature with the target speaker voice feature; generating, by inputting the original voice feature or the extracted target speaker feature into a second neural network, using a result of the determination of the utterance scenario, a final target speaker voice feature; and generating a user verification result by determining whether the target speaker corresponds to a user by comparing the final target speaker voice feature with a final user voice feature, wherein the determined utterance scenario comprises a single-speaker scenario and a multiple-speaker scenario. 2 . The method of claim 1 , wherein the extracting of the target speaker voice feature comprises: inputting the original voice feature and a middle user embedding voice feature to the first neural network. 3 . The method of claim 2 , wherein the first neural network comprises: a first convolution layer configured to output a speaker extraction embedding feature, for extracting the target speaker voice feature included in the input voice, based on an input of the original voice feature; a splicing layer configured to output a splicing feature based on an input of the speaker extraction embedding feature and the middle user embedding voice feature; a second convolution layer configured to output a mask based on an input of the splicing feature; and a multiplier configured to output the target speaker voice feature based on an input of the mask and the speaker extraction embedding feature. 4 . The method of claim 1 , wherein the determining of the utterance scenario of the input voice by comparing the original voice feature and the target speaker voice feature comprises either of: in response to a mean squared error between the original voice feature and the target speaker voice feature being less than a threshold value, determining the utterance scenario as the single-speaker scenario; and in response to the mean squared error between the original voice feature and the target speaker voice feature being greater than or equal to the threshold value, determining the utterance scenario as the multiple-speaker scenario. 5 . The method of claim 1 , wherein, by the second neural network, the generating of the final target speaker voice feature comprises: in response to determining the utterance scenario as the single-speaker scenario, inputting the original voice feature to the second neural network; and in response to determining the utterance scenario as the multiple-speaker scenario, inputting the target speaker voice feature to the second neural network. 6 . The method of claim 5 , wherein the second neural network comprises: a speaker embedding layer configured to output a target speaker middle embedding voice feature based on an input of either one of the original voice feature and the target speaker voice feature; and an attentive statistics pooling layer configured to output the final target speaker voice feature based on an input of the target speaker middle embedding voice feature. 7 . The method of claim 6 , wherein the determining of whether the target speaker corresponds to the user comprises: determining a similarity value between the final target speaker voice feature and the final user voice feature; and determining whether the target speaker corresponds to the user based on the determined similarity value. 8 . The method of claim 7 , wherein the middle user embedding voice feature is generated as a result of inputting a user voice feature, which is generated based on an input of a user voice, to the speaker embedding layer, and the final user voice feature is generated as a result of inputting the middle user embedding voice feature to the attentive statistics pooling layer. 9 . The method of claim 8 , wherein the first neural network and the second neural network are trained jointly with a third network for converting a speaker voice feature based on a speaker middle embedding voice feature. 10 . An electronic device comprising: one or more processors; a memory comprising one or more non-transitory storage media that store instructions that, when executed by the one or more processors, configures the device to: generate an original voice feature based on an input voice of a target speaker; extract, by a first neural network, a target speaker voice feature based on the original voice feature; determine whether an utterance scenario of the input voice is a single-speaker scenario or a multiple-speaker scenario by comparing the original voice feature with the target speaker voice feature; generate, by inputting the original voice feature or the extracted target speaker feature into a second neural network, using a result of the determination of the utterance scenario, a final target speaker voice feature; and generate a user verification result by determining whether the target speaker corresponds to a user by comparing the final target speaker voice feature with a final user voice feature, wherein the determined utterance scenario comprises a single-speaker scenario and a multiple-speaker scenario. 11 . The electronic device of claim 10 , wherein, for the extracting of the target speaker voice feature, the execution of the instructions further configures the device to: inputting the original voice feature and a middle user embedding voice feature to the first neural network. 12 . The electronic device of claim 11 , wherein the first neural network comprises: a first convolution layer configured to output a speaker extraction embedding feature, for extracting the target speaker voice feature included in the input voice, based on an input of the original voice feature; a splicing layer which outputs a splicing feature based on an input of the speaker extraction embedding feature and the middle user embedding voice feature; a second convolution layer which outputs a mask based on an input of the splicing feature; and a multiplier which outputs the target speaker voice feature based on an input of the mask and the speaker extraction embedding feature. 13 . The electronic device of claim 10 , wherein, for the determining of the utterance scenario of the input voice by comparing the original voice feature and the target speaker voice feature, the execution of the instructions further configures the device to: in response to a mean squared error between the original voice feature and the target speaker voice feature being less than a threshold value, determine the utterance scenario as the single-speaker scenario; and in response to the mean squared error between the original voice feature and the target speaker voice feature being greater than or equal to the threshold value, determine the utterance scenario as the multiple-speaker scenario. 14 . The electronic device of claim 13 , wherein, for the generating of the final target speaker voice feature, the execution of the instructions further configures the device to: in response to determining the utterance scenario as the single-speaker scenario, generate the final target speaker voice feature by inputting the original voice feature to the second neural network; and in response to determining the utterance scenario as the mu

Assignees

Inventors

Classifications

  • Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

  • for comparison or discrimination · CPC title

  • using neural networks · CPC title

  • Artificial neural networks; Connectionist approaches · CPC title

  • G10L15/02Primary

    Feature extraction for speech recognition; Selection of recognition unit · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12555565B2 cover?
A processor-implemented method includes: extracting a target speaker voice feature based on an input voice of a target speaker; determining an utterance scenario of the input voice based on the target speaker voice feature; generating a final target speaker voice feature based on the determined utterance scenario; and determining whether the target speaker corresponds to a user based on the fin…
Who is the assignee on this patent?
Samsung Electronics Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L15/02. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 17 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).