What technology area does this patent fall under?

Primary CPC classification G06N20/20. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 07 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Speaker recognition with quality indicators

US12190905B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12190905-B2
Application number	US-202117408281-A
Country	US
Kind code	B2
Filing date	Aug 20, 2021
Priority date	Aug 21, 2020
Publication date	Jan 7, 2025
Grant date	Jan 7, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments described herein provide for a machine-learning architecture for modeling quality measures for enrollment signals. Modeling these enrollment signals enables the machine-learning architecture to identify deviations from expected or ideal enrollment signal in future test phase calls. These differences can be used to generate quality measures for the various audio descriptors or characteristics of audio signals. The quality measures can then be fused at the score-level with the speaker recognition's embedding comparisons for verifying the speaker. Fusing the quality measures with the similarity scoring essentially calibrates the speaker recognition's outputs based on the realities of what is actually expected for the enrolled caller and what was actually observed for the current inbound caller.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: extracting from an inbound audio signal for an inbound speaker, by a computer, a feature vector for one or more acoustic features; generating, by the computer, one or more quality measures and an overall quality measure for the inbound audio signal, by applying a first machine-learning architecture to the feature vector for the one or more acoustic features, the one or more quality measures corresponding to_a similarity between one or more expected quality descriptors and one or more quality descriptors for the call audio of the inbound audio signal; extracting, by the computer, an inbound speaker embedding for the inbound speaker from the one or more acoustic features for the inbound audio signal, by applying a second machine-learning architecture to the feature vector for the one or more acoustic features of the inbound audio signal; generating, by the computer, a first similarity score for the inbound speaker based upon the inbound speaker embedding and an enrolled voiceprint for an enrolled speaker, by applying the second machine-learning architecture; generating, by the computer, a second similarity score for verifying the inbound speaker, the second similarity score generated based upon the one or more quality measures and the first similarity score; and verifying, by the computer, the inbound speaker as the enrolled speaker based upon comparing the second similarity score against a verification threshold. 2. The method according to claim 1 , wherein generating the one or more quality measures for the inbound audio signal includes generating, by the computer, the overall quality measure based upon each of the quality measures. 3. The method according to claim 1 , wherein generating the one or more quality measures includes: generating, by the computer, a plurality of speech segments from the inbound audio signal; and determining, by the computer, a total duration of speech based upon the plurality of speech segments. 4. The method according to claim 1 , wherein the first machine-learning architecture generates a quality embedding corresponding to each respective quality descriptor. 5. The method according to claim 4 , wherein the quality descriptor includes at least one of an audio event descriptor, a codec descriptor, a microphone type descriptor, a device type, and a network type. 6. The method according to claim 1 , wherein generating a quality measure includes determining, by the computer, a similarity between the inbound speaker embedding and a corresponding enrolled speaker embedding for an enrolled audio signal, wherein the quality measure is based upon the similarity. 7. The method according to claim 1 , further comprising: receiving, by the computer, one or more clean enrollment audio signals for the enrolled speaker; generating, by the computer, one or more degraded enrollment audio signals corresponding to the one or more clean enrollment audio signals according to a type of degradation; and extracting, by the computer, one or more enrolled quality embeddings for the enrolled speaker by applying the first machine-learning architecture on the one or more clean enrollment audio signals and the one or more degraded enrollment audio signals. 8. A system comprising: a database configured store an enrolled voiceprint for an enrolled speaker; and a server comprising a processor configured to: extract from an inbound audio signal for an inbound speaker a feature vector for one or more acoustic features; generate one or more quality measures and an overall quality measure for the inbound audio signal, by applying a first machine-learning architecture to the feature vector for the one or more acoustic features, the one or more quality measures corresponding to a similarity between one or more expected quality descriptors and one or more quality descriptors for the call audio of the inbound audio signal; extract an inbound speaker embedding for the inbound speaker from the one or more acoustic features for the inbound audio signal, by applying a second machine-learning architecture to the feature vector for the one or more acoustic features of the inbound audio signal; generate a first similarity score for the inbound speaker based upon the inbound speaker embedding and the enrolled voiceprint for the enrolled speaker, by applying the second machine-learning architecture; generate a second similarity score for verifying the inbound speaker, the second similarity score generated based upon the one or more quality measures and the first similarity score; and verify the inbound speaker as the enrolled speaker based upon comparing the second similarity score against a verification threshold. 9. The system according to claim 8 , wherein when generating the one or more quality measures for the inbound audio signal, the server is further configured to generate the overall quality measure based upon each of the quality measures. 10. The system according to claim 8 , wherein when generating the one or more quality measures, the server is further configured to: generate a plurality of speech segments from the inbound audio signal; and determine a total duration of speech based upon the plurality of speech segments. 11. The system according to claim 8 , wherein when generating a quality measure the server is configured to: determine a similarity between the inbound speaker embedding and a corresponding enrolled speaker embedding for an enrolled audio signal, wherein the quality measure is based upon the similarity. 12. The system according to claim 8 , wherein the server is further configured to: receive one or more clean enrollment audio signals for the enrolled speaker; generate one or more degraded enrollment audio signals corresponding to the one or more clean enrollment audio signals according to a type of degradation; and extract one or more enrolled quality embeddings for the enrolled speaker by applying the first machine-learning architecture on the one or more clean enrollment audio signals and the one or more degraded enrollment audio signals. 13. A non-transitory computer-readable medium comprising a non-transitory storage memory configured to store machine-readable instructions that when executed by a processor instruct the processor to: extract from an inbound audio signal for an inbound speaker a feature vector for one or more acoustic features; generate one or more quality measures and an overall quality measure for the inbound audio signal, by applying a first machine-learning architecture to the feature vector for the one or more acoustic features; extract an inbound speaker embedding for the inbound speaker from the one or more acoustic features for the inbound audio signal, by applying a second machine-learning architecture to the feature vector for the one or more acoustic features of the inbound audio signal, the one or more quality measures corresponding to a similarity between one or more expected quality descriptors and one or more quality descriptors for the call audio of the inbound audio signal; generate a first similarity score for the inbound speaker based upon the inbound speaker embedding and an enrolled voiceprint for an enrolled speaker, by applying the second machine-learning architecture; generate a second similarity score for verifying the inbound speaker, the second similarity score generated based upon the one or more quality measures and the first similarity score; and verify the inbound speaker as the enrolled speaker based upon comparing the second similarity score against a verification threshold. 14

Assignees

Pindrop Security Inc

Inventors

Classifications

G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/09
Supervised learning · CPC title
G10L15/02
Feature extraction for speech recognition; Selection of recognition unit · CPC title
G06N20/20Primary
Ensemble learning · CPC title
G06N3/045
Combinations of networks · CPC title

Patent family

Related publications grouped by family.

View patent family 80269005

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12190905B2 cover?: Embodiments described herein provide for a machine-learning architecture for modeling quality measures for enrollment signals. Modeling these enrollment signals enables the machine-learning architecture to identify deviations from expected or ideal enrollment signal in future test phase calls. These differences can be used to generate quality measures for the various audio descriptors or charac…
Who is the assignee on this patent?: Pindrop Security Inc
What technology area does this patent fall under?: Primary CPC classification G06N20/20. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 07 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).