System and method for personalized speaker verification

US11031018B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11031018-B2
Application numberUS-202017131182-A
CountryUS
Kind codeB2
Filing dateDec 22, 2020
Priority dateOct 31, 2019
Publication dateJun 8, 2021
Grant dateJun 8, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for personalized speaker verification are provided. One of the methods includes: obtaining first speech data of a speaker as a positive sample and second speech data of an entity different from the speaker as a negative sample; feeding the positive sample and the negative sample to a first model for determining voice characteristics to correspondingly output a positive voice characteristic and a negative voice characteristic of the speaker; obtaining a gradient based at least on the positive voice characteristic and the negative voice characteristic; and feeding the gradient to the first model to update one or more parameters of the first model to obtain a second model for personalized speaker verification.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method for personalized speaker verification, comprising: obtaining first speech data of a speaker as a positive sample and second speech data of an entity different from the speaker as a negative sample, wherein the first speech data comprises one or more speech segments of the speaker, and wherein the second speech data comprises one or more speech segments of one or more people other than the speaker; feeding the positive sample and the negative sample to a first model for determining voice characteristics to correspondingly output a positive voice characteristic and a negative voice characteristic of the speaker by at least feeding the one or more speech segments of the speaker into the first model to correspondingly output one or more positive sample vectors, and feeding the one or more speech segments of the one or more people other than the speaker into the first model to correspondingly output one or more negative sample vectors; obtaining a gradient based at least on the positive voice characteristic and the negative voice characteristic; and feeding the gradient to the first model to update one or more parameters of the first model to obtain a second model for personalized speaker verification. 2. The method of claim 1 , further comprising: averaging the one or more positive sample vectors to obtain a template vector of the speaker. 3. The method of claim 2 , further comprising: obtaining speech data of a user; feeding the obtained speech data to the second model to obtain an input vector of the user; comparing the input vector of the user with the template vector of the speaker; and verifying if the user is the speaker based at least on the comparison. 4. The method of claim 1 , wherein obtaining the gradient based at least on the positive voice characteristic and the negative voice characteristic comprises: feeding the one or more positive sample vectors and the one or more negative sample vectors into a neural network classifier to obtain one or more gradient vectors. 5. The method of claim 4 , wherein obtaining the gradient based at least on the positive voice characteristic and the negative voice characteristic further comprises: averaging the one or more gradient vectors to obtain an average gradient vector of the speaker as the gradient. 6. The method of claim 5 , wherein: feeding the gradient to the first model to update the one or more parameters of the first model comprises: feeding the average gradient vector of the speaker to the first model to update the one or more parameters of the first model; and the one or more parameters associate different neural layers of the first model. 7. The method of claim 4 , wherein feeding the one or more positive sample vectors and the one or more negative sample vectors into the neural network classifier to obtain one or more gradient vectors comprises: obtaining the gradient based at least on backward propagation through a cross-entropy loss function of the neural network classifier. 8. The method of claim 1 , wherein feeding the gradient to the first model to update the one or more parameters of the first model comprises: feeding the gradient to the first model to update the one or more parameters of the first model based at least on the gradient and an online machine learning rate. 9. The method of claim 8 , feeding the gradient to the first model to update the one or more parameters of the first model based at least on the gradient and the online machine learning rate comprises: updating the one or more parameters in a direction in which the gradient descents at a fastest online machine learning rate. 10. The method of claim 1 , wherein: before feeding the positive sample and the negative sample to the first model for determining voice characteristics, the first model has been trained at least by jointly minimizing a first loss function that optimizes speaker classification and a second loss function that optimizes speaker clustering. 11. The method of claim 10 , wherein: the first loss function comprises a non-sampling-based loss function; and the second loss function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix. 12. One or more non-transitory computer-readable storage media storing instructions executable by one or more processors, wherein execution of the instructions causes the one or more processors to perform operations comprising: obtaining first speech data of a speaker as a positive sample and second speech data of an entity different from the speaker as a negative sample, wherein the first speech data comprises one or more speech segments of the speaker, and wherein the second speech data comprises one or more speech segments of one or more people other than the speaker; feeding the positive sample and the negative sample to a first model for determining voice characteristics to correspondingly output a positive voice characteristic and a negative voice characteristic of the speaker by at least feeding the one or more speech segments of the speaker into the first model to correspondingly output one or more positive sample vectors, and feeding the one or more speech segments of the one or more people other than the speaker into the first model to correspondingly output one or more negative sample vectors; obtaining a gradient based at least on the positive voice characteristic and the negative voice characteristic; and feeding the gradient to the first model to update one or more parameters of the first model to obtain a second model for personalized speaker verification. 13. The one or more non-transitory computer-readable storage media of claim 12 , wherein the operations further comprise: averaging the one or more positive sample vectors to obtain a template vector of the speaker. 14. The one or more non-transitory computer-readable storage media of claim 13 , wherein the operations further comprise: obtaining speech data of a user; feeding the obtained speech data to the second model to obtain an input vector of the user; comparing the input vector of the user with the template vector of the speaker; and verifying if the user is the speaker based at least on the comparison. 15. The one or more non-transitory computer-readable storage media of claim 12 , wherein obtaining the gradient based at least on the positive voice characteristic and the negative voice characteristic comprises: feeding the one or more positive sample vectors and the one or more negative sample vectors into a neural network classifier to obtain one or more gradient vectors. 16. The one or more non-transitory computer-readable storage media of claim 15 , wherein obtaining the gradient based at least on the positive voice characteristic and the negative voice characteristic further comprises: averaging the one or more gradient vectors to obtain an average gradient vector of the speaker as the gradient. 17. The one or more non-transitory computer-readable storage media of claim 16 , wherein: feeding the gradient to the first model to update the one or more parameters of the first model comprises: feeding the average gradient vector of the speaker to the first model to update the one or more parameters of the first model; and the one or more parameters associate different neural layers of the first model. 18. A system comprising one or more processors and one or more non-transitory computer-readable memories coupled to the one or more processors and configured with instruc

Assignees

Inventors

Classifications

  • G10L17/04Primary

    Training, enrolment or model building · CPC title

  • Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title

  • G10L17/18Primary

    Artificial neural networks; Connectionist approaches · CPC title

  • Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions · CPC title

  • Decision making techniques; Pattern matching strategies · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11031018B2 cover?
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for personalized speaker verification are provided. One of the methods includes: obtaining first speech data of a speaker as a positive sample and second speech data of an entity different from the speaker as a negative sample; feeding the positive sample and the negative sample to a first model for …
Who is the assignee on this patent?
Alipay Hangzhou Inf Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L17/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 08 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).