System for reducing transaction failure
US-12175472-B2 · Dec 24, 2024 · US
US2025124945A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025124945-A1 |
| Application number | US-202418989690-A |
| Country | US |
| Kind code | A1 |
| Filing date | Dec 20, 2024 |
| Priority date | Aug 21, 2020 |
| Publication date | Apr 17, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments described herein provide for a machine-learning architecture for modeling quality measures for enrollment signals. Modeling these enrollment signals enables the machine-learning architecture to identify deviations from expected or ideal enrollment signal in future test phase calls. These differences can be used to generate quality measures for the various audio descriptors or characteristics of audio signals. The quality measures can then be fused at the score-level with the speaker recognition's embedding comparisons for verifying the speaker. Fusing the quality measures with the similarity scoring essentially calibrates the speaker recognition's outputs based on the realities of what is actually expected for the enrolled caller and what was actually observed for the current inbound caller.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method comprising: extracting from an inbound audio signal for an inbound speaker, by a computer, a feature vector for one or more acoustic features; generating, by the computer, one or more quality measures and an overall quality measure for the inbound audio signal, by executing a machine-learning architecture using as input the feature vector for the one or more acoustic features, the one or more quality measures corresponding to a similarity between one or more expected quality descriptors and one or more quality descriptors for call audio of the inbound audio signal; generating, by the computer, a final similarity score for verifying the inbound speaker by combining an initial similarity score with the one or more quality measures or the overall quality measure; and verifying, by the computer, the inbound speaker as an enrolled speaker based upon comparing the final similarity score against a verification threshold. 2 . The method according to claim 1 , further comprising generating, by the computer, the initial similarity score by executing a second machine-learning architecture using as input an inbound speaker embedding and an enrolled voiceprint for the enrolled speaker. 3 . The method according to claim 2 , further comprising generating, by the computer, the enrolled voiceprint by combining a plurality of enrollee embeddings. 4 . The method according to claim 3 , further comprising generating, by the computer, the plurality of enrollee embeddings by executing the second machine-learning architecture using as input a plurality of enrollee audio samples, wherein the inbound speaker embedding is generated by executing the second machine-learning architecture using as input the feature vector for the one or more acoustic features of the inbound audio signal. 5 . The method according to claim 1 , wherein generating the one or more quality measures for the inbound audio signal includes generating, by the computer, the overall quality measure based upon each of the quality measures. 6 . The method according to claim 1 , wherein generating the one or more quality measures includes: generating, by the computer, a plurality of speech segments from the inbound audio signal; and determining, by the computer, a total duration of speech based upon the plurality of speech segments. 7 . The method according to claim 1 , wherein generating a quality measure includes determining, by the computer, a level of similarity between an inbound speaker embedding and a corresponding enrolled speaker embedding for an enrolled audio signal. 8 . The method according to claim 1 , further comprising: receiving, by the computer, one or more clean enrollment audio signals for the enrolled speaker; generating, by the computer, one or more degraded enrollment audio signals corresponding to the one or more clean enrollment audio signals according to a type of degradation; and extracting, by the computer, one or more enrolled quality embeddings for the enrolled speaker by applying a second machine-learning architecture on the one or more clean enrollment audio signals and the one or more degraded enrollment audio signals. 9 . The method according to claim 8 , further comprising enabling, by the computer, classification layers and loss layers of the second machine-learning architecture in a training phase of the second machine-learning architecture. 10 . The method according to claim 8 , further comprising disabling, by the computer, classification layers and loss layers of the second machine-learning architecture in a deployment phase of the second machine-learning architecture. 11 . A system comprising: a database configured store an enrolled voiceprint for an enrolled speaker; and a server comprising a processor configured to: generate one or more quality measures and an overall quality measure for an inbound audio signal, by executing a machine-learning architecture using as input a feature vector for one or more acoustic features, the one or more quality measures corresponding to a similarity between one or more expected quality descriptors and one or more quality descriptors for call audio of the inbound audio signal; generate a final similarity score for verifying the inbound speaker by combining an initial similarity score with the one or more quality measures or the overall quality measure; and verify the inbound speaker as an enrolled speaker based upon comparing the final similarity score against a verification threshold. 12 . The system according to claim 11 , wherein the processor is further configured to generate the initial similarity score by executing a second machine-learning architecture using as input an inbound speaker embedding and an enrolled voiceprint for the enrolled speaker. 13 . The system according to claim 12 , wherein the processor is further configured to generate the enrolled voiceprint by combining a plurality of enrollee embeddings. 14 . The system according to claim 13 , wherein the processor is further configured to generate the plurality of enrollee embeddings by executing the second machine-learning architecture using as input a plurality of enrollee audio samples, wherein the inbound speaker embedding is generated by executing the second machine-learning architecture using as input the feature vector for the one or more acoustic features of the inbound audio signal. 15 . The system according to claim 11 , wherein the processor is further configured to generate the one or more quality measures for the inbound audio signal by generating the overall quality measure based upon each of the quality measures. 16 . The system according to claim 11 , wherein the processor is further configured to generate the one or more quality measures by: generating a plurality of speech segments from the inbound audio signal; and determining a total duration of speech based upon the plurality of speech segments. 17 . The system according to claim 11 , wherein the processor is further configured to generate a quality measure by determining a level of similarity between an inbound speaker embedding and a corresponding enrolled speaker embedding for an enrolled audio signal. 18 . The system according to claim 11 , wherein the processor is further configured to: receive one or more clean enrollment audio signals for the enrolled speaker; generate one or more degraded enrollment audio signals corresponding to the one or more clean enrollment audio signals according to a type of degradation; and extract one or more enrolled quality embeddings for the enrolled speaker by applying a second machine-learning architecture on the one or more clean enrollment audio signals and the one or more degraded enrollment audio signals. 19 . The system according to claim 18 , wherein the processor is further configured to enable classification layers and loss layers of the second machine-learning architecture in a training phase of the second machine-learning architecture. 20 . The system according to claim 19 , wherein the processor is further configured to disable the classification layers and loss layers of the second machine-learning architecture in a deployment phase of the second machine-learning architecture.
Convolutional networks [CNN, ConvNet] · CPC title
Supervised learning · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Ensemble learning · CPC title
Combinations of networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.