Who is the assignee on this patent?

Alipay Hangzhou Inf Tech Co Ltd

What technology area does this patent fall under?

Primary CPC classification G10L17/18. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 04 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System and method for determining voice characteristics

US10997980B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10997980-B2
Application number	US-202017081956-A
Country	US
Kind code	B2
Filing date	Oct 27, 2020
Priority date	Oct 31, 2019
Publication date	May 4, 2021
Grant date	May 4, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining voice characteristics are provided. One of the methods includes: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method for determining voice characteristics, comprising: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second loss function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker. 2. The method of claim 1 , wherein: training the model by minimizing the first loss function optimizes speaker classification; and training the model by minimizing the second loss function optimizes speaker clustering. 3. The method of claim 1 , wherein: the non-unit multi-variant covariance matrix comprises a standard deviation diagonal matrix. 4. The method of claim 1 , wherein: the Gaussian mixture loss function with non-unit multi-variant covariance matrix comprises a large margin Gaussian mixture loss function. 5. The method of claim 1 , wherein: the non-sampling-based loss function comprises an additive margin softmax loss function. 6. The method of claim 1 , wherein: the first loss function acts as a regularizer to the second loss function; and the second loss function acts as a regularizer to the first loss function. 7. The method of claim 1 , further comprising: obtaining the one or more voice characteristics for each of one or more speakers; obtaining the one or more voice characteristics for a candidate user; comparing the one or more voice characteristics of the candidate user with the one or more voice characteristics of the each of the one or more speakers; and identifying whether the candidate user is any of the one or more speakers based at least on the comparison. 8. The method of claim 1 , further comprising: obtaining the one or more voice characteristics for a candidate user; comparing the one or more voice characteristics of the candidate user with the one or more voice characteristics of the speaker; and verifying whether the candidate user is the speaker based at least on the comparison. 9. The method of claim 7 , wherein: comparing the one or more voice characteristics of the candidate user with the one or more voice characteristics of the each of the one or more speakers comprises: comparing, with a threshold, a distance between a vector representing the one or more voice characteristics of the candidate user and a different vector representing the one or more voice characteristics of the each of the one or more speakers. 10. The method of claim 1 , wherein: obtaining the speech data of the speaker comprises obtaining a spectrogram corresponding to the speech data, and obtaining a plurality of feature vectors corresponding to the spectrogram; and inputting the speech data into the trained model comprises inputting the plurality of feature vectors into the trained model. 11. The method of claim 10 , wherein the trained model comprises: a first convolution layer configured to receive the plurality of feature vectors as an input of the first convolution layer; a first pooling layer configured to receive an output of the first convolution layer as an input of the first pooling layer; a plurality of residual network layers configured to receive an output of the first pooling layer as an input of the plurality of residual network layers; a second convolution layer configured to receive an output of the plurality of residual network layers as an input of the second convolution layer; a second pooling layer configured to receive an output of the second convolution layer as an input of the second pooling layer; and an embedding layer configured to receive an output of the second pooling layer as an input of the embedding layer and output a vector representing the one or more voice characteristics of the speaker. 12. The method of claim 11 , wherein: minimizing the first loss function comprises, for at least the embedding layer, minimizing a non-sampling-based loss function to optimize between-class classification error; and minimizing the second loss function comprises, for at least the embedding layer, minimizing a Gaussian mixture loss function with non-unit multi-variant covariance matrix to reduce intra-class variation. 13. The method of claim 1 , wherein: minimizing the first loss function comprises increasing a margin linearly from zero to a target margin value for annealing. 14. A non-transitory computer-readable storage medium storing instructions executable by one or more processors, wherein execution of the instructions causes the one or more processors to perform operations comprising: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second loss function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker. 15. The non-transitory computer-readable storage medium of claim 14 , wherein: training the model by minimizing the first loss function optimizes speaker classification; and training the model by minimizing the second loss function optimizes speaker clustering. 16. The non-transitory computer-readable storage medium of claim 14 , wherein: the non-unit multi-variant covariance matrix comprises a standard deviation diagonal matrix. 17. The non-transitory computer-readable storage medium of claim 14 , wherein: the Gaussian mixture loss function with non-unit multi-variant covariance matrix comprises a large margin Gaussian mixture loss function. 18. The non-transitory computer-readable storage medium of claim 14 , wherein: the non-sampling-based loss function comprises an additive margin softmax loss function. 19. The non-transitory computer-readable storage medium of claim 14 , wherein: the first loss function acts as a regularizer to the second loss function; and the second loss function acts as a regularizer to the first loss function. 20. A system comprising one or more processors and one or more non-transitory computer-readable memories coupled to the one or more processors and configured with instructions executable by the one or more processors to cause the system to perform operations comprising: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling-based loss function and the second loss function comprises a Gaussian mixture loss function with non-unit multi-variant covariance matrix; and obtaining from the trained model one or more voice characteristics of the speaker.

Assignees

Alipay Hangzhou Inf Tech Co Ltd

Inventors

Classifications

G10L17/18Primary
Artificial neural networks; Connectionist approaches · CPC title
G10L17/04Primary
Training, enrolment or model building · CPC title
G10L17/02
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
G10L17/20
Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions · CPC title
G10L17/06
Decision making techniques; Pattern matching strategies · CPC title

Patent family

Related publications grouped by family.

View patent family 69525955

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10997980B2 cover?: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for determining voice characteristics are provided. One of the methods includes: obtaining speech data of a speaker; inputting the speech data into a model trained at least by jointly minimizing a first loss function and a second loss function, wherein the first loss function comprises a non-sampling…
Who is the assignee on this patent?: Alipay Hangzhou Inf Tech Co Ltd
What technology area does this patent fall under?: Primary CPC classification G10L17/18. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 04 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).