Speaker Identification Accuracy

US2023015169A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023015169-A1
Application numberUS-202217933164-A
CountryUS
Kind codeA1
Filing dateSep 19, 2022
Priority dateOct 15, 2020
Publication dateJan 19, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method of generating an accurate speaker representation for an audio sample includes receiving a first audio sample from a first speaker and a second audio sample from a second speaker. The method includes dividing a respective audio sample into a plurality of audio slices. The method also includes, based on the plurality of slices, generating a set of candidate acoustic embeddings where each candidate acoustic embedding includes a vector representation of acoustic features. The method further includes removing a subset of the candidate acoustic embeddings from the set of candidate acoustic embeddings. The method additionally includes generating an aggregate acoustic embedding from the remaining candidate acoustic embeddings in the set of candidate acoustic embeddings after removing the subset of the candidate acoustic embeddings.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method when executed by data processing hardware causes the data processing hardware to perform operations comprising: receiving a first audio sample from a first speaker and a second audio sample from a second speaker; for each audio sample of the first audio sample and the second audio sample, generating a first respective sample variation by performing a first spectrogram augmentation technique on a frequency representation of the respective audio sample; generating a first score based on a comparison of the first respective sample variations; and generating, using a model, a prediction indicating whether the first speaker and the second speaker are the same speaker or different speakers based on the first score. 2 . The method of claim 1 , wherein the operations further comprise: for each audio sample of the first audio sample and the second audio sample, generating a second respective sample variation by performing a second spectrogram augmentation technique on the frequency representation of the respective audio sample. generating a second score based on a comparison of the second respective sample variations; and generating, using the model, a second prediction indicating whether the first speaker and the second speaker are the same speaker or different speakers based on the second score. 3 . The method of claim 2 , wherein the second prediction is based on the first score and the second score. 4 . The method of claim 2 , wherein the first spectrogram augmentation technique and the second spectrogram augmentation technique are different. 5 . The method of claim 2 , wherein the operations further comprise: for each audio sample of the first audio sample and the second audio sample, generating a third respective sample variation by performing a third spectrogram augmentation technique on the frequency representation of the respective audio sample; generating a third score based on a comparison of the third respective sample variations; and generating, using the model, a third prediction indicating whether the first speaker and the second speaker are the same speaker or different speakers based on the third score. 6 . The method of claim 5 , wherein the third prediction is based on the first score, the second score, and the third score. 7 . The method of claim 5 , wherein the first spectrogram augmentation technique is different than the second spectrogram augmentation technique and the third spectrogram augmentation technique is different than the first spectrogram augmentation technique and the second spectrogram augmentation technique. 8 . The method of claim 1 , wherein the first spectrogram augmentation technique comprises one of: a time masking technique; a frequency masking technique; or a time warping technique. 9 . The method of claim 1 , wherein the model comprises a Long Short-Term Memory (LSTM) neural network. 10 . The method of claim 1 , wherein the operations further comprise training the model by iteratively updating current values of one or more parameters of the model over a series of training cycles. 11 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving a first audio sample from a first speaker and a second audio sample from a second speaker; for each audio sample of the first audio sample and the second audio sample, generating a first respective sample variation by performing a first spectrogram augmentation technique on a frequency representation of the respective audio sample; generating a first score based on a comparison of the first respective sample variations; and generating, using a model, a prediction indicating whether the first speaker and the second speaker are the same speaker or different speakers based on the first score. 12 . The system of claim 11 , wherein the operations further comprise: for each audio sample of the first audio sample and the second audio sample, generating a second respective sample variation by performing a second spectrogram augmentation technique on the frequency representation of the respective audio sample. generating a second score based on a comparison of the second respective sample variations; and generating, using the model, a second prediction indicating whether the first speaker and the second speaker are the same speaker or different speakers based on the second score. 13 . The system of claim 12 , wherein the second prediction is based on the first score and the second score. 14 . The system of claim 12 , wherein the first spectrogram augmentation technique and the second spectrogram augmentation technique are different. 15 . The system of claim 12 , wherein the operations further comprise: for each audio sample of the first audio sample and the second audio sample, generating a third respective sample variation by performing a third spectrogram augmentation technique on the frequency representation of the respective audio sample; generating a third score based on a comparison of the third respective sample variations; and generating, using the model, a third prediction indicating whether the first speaker and the second speaker are the same speaker or different speakers based on the third score. 16 . The system of claim 15 , wherein the third prediction is based on the first score, the second score, and the third score. 17 . The system of claim 15 , wherein the first spectrogram augmentation technique is different than the second spectrogram augmentation technique and the third spectrogram augmentation technique is different than the first spectrogram augmentation technique and the second spectrogram augmentation technique. 18 . The system of claim 11 , wherein the first spectrogram augmentation technique comprises one of: a time masking technique; a frequency masking technique; or a time warping technique. 19 . The system of claim 11 , wherein the model comprises a Long Short-Term Memory (LSTM) neural network. 20 . The system of claim 11 , wherein the operations further comprise training the model by iteratively updating current values of one or more parameters of the model over a series of training cycles

Assignees

Inventors

Classifications

  • Artificial neural networks; Connectionist approaches · CPC title

  • Use of distortion metrics or a particular distance between probe pattern and reference templates · CPC title

  • G10L17/06Primary

    Decision making techniques; Pattern matching strategies · CPC title

  • G10L17/02Primary

    Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023015169A1 cover?
A method of generating an accurate speaker representation for an audio sample includes receiving a first audio sample from a first speaker and a second audio sample from a second speaker. The method includes dividing a respective audio sample into a plurality of audio slices. The method also includes, based on the plurality of slices, generating a set of candidate acoustic embeddings where each…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L17/06. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 19 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).