Speech detection and speech recognition
US-10923111-B1 · Feb 16, 2021 · US
US12190905B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12190905-B2 |
| Application number | US-202117408281-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 20, 2021 |
| Priority date | Aug 21, 2020 |
| Publication date | Jan 7, 2025 |
| Grant date | Jan 7, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments described herein provide for a machine-learning architecture for modeling quality measures for enrollment signals. Modeling these enrollment signals enables the machine-learning architecture to identify deviations from expected or ideal enrollment signal in future test phase calls. These differences can be used to generate quality measures for the various audio descriptors or characteristics of audio signals. The quality measures can then be fused at the score-level with the speaker recognition's embedding comparisons for verifying the speaker. Fusing the quality measures with the similarity scoring essentially calibrates the speaker recognition's outputs based on the realities of what is actually expected for the enrolled caller and what was actually observed for the current inbound caller.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: extracting from an inbound audio signal for an inbound speaker, by a computer, a feature vector for one or more acoustic features; generating, by the computer, one or more quality measures and an overall quality measure for the inbound audio signal, by applying a first machine-learning architecture to the feature vector for the one or more acoustic features, the one or more quality measures corresponding to_a similarity between one or more expected quality descriptors and one or more quality descriptors for the call audio of the inbound audio signal; extracting, by the computer, an inbound speaker embedding for the inbound speaker from the one or more acoustic features for the inbound audio signal, by applying a second machine-learning architecture to the feature vector for the one or more acoustic features of the inbound audio signal; generating, by the computer, a first similarity score for the inbound speaker based upon the inbound speaker embedding and an enrolled voiceprint for an enrolled speaker, by applying the second machine-learning architecture; generating, by the computer, a second similarity score for verifying the inbound speaker, the second similarity score generated based upon the one or more quality measures and the first similarity score; and verifying, by the computer, the inbound speaker as the enrolled speaker based upon comparing the second similarity score against a verification threshold. 2. The method according to claim 1 , wherein generating the one or more quality measures for the inbound audio signal includes generating, by the computer, the overall quality measure based upon each of the quality measures. 3. The method according to claim 1 , wherein generating the one or more quality measures includes: generating, by the computer, a plurality of speech segments from the inbound audio signal; and determining, by the computer, a total duration of speech based upon the plurality of speech segments. 4. The method according to claim 1 , wherein the first machine-learning architecture generates a quality embedding corresponding to each respective quality descriptor. 5. The method according to claim 4 , wherein the quality descriptor includes at least one of an audio event descriptor, a codec descriptor, a microphone type descriptor, a device type, and a network type. 6. The method according to claim 1 , wherein generating a quality measure includes determining, by the computer, a similarity between the inbound speaker embedding and a corresponding enrolled speaker embedding for an enrolled audio signal, wherein the quality measure is based upon the similarity. 7. The method according to claim 1 , further comprising: receiving, by the computer, one or more clean enrollment audio signals for the enrolled speaker; generating, by the computer, one or more degraded enrollment audio signals corresponding to the one or more clean enrollment audio signals according to a type of degradation; and extracting, by the computer, one or more enrolled quality embeddings for the enrolled speaker by applying the first machine-learning architecture on the one or more clean enrollment audio signals and the one or more degraded enrollment audio signals. 8. A system comprising: a database configured store an enrolled voiceprint for an enrolled speaker; and a server comprising a processor configured to: extract from an inbound audio signal for an inbound speaker a feature vector for one or more acoustic features; generate one or more quality measures and an overall quality measure for the inbound audio signal, by applying a first machine-learning architecture to the feature vector for the one or more acoustic features, the one or more quality measures corresponding to a similarity between one or more expected quality descriptors and one or more quality descriptors for the call audio of the inbound audio signal; extract an inbound speaker embedding for the inbound speaker from the one or more acoustic features for the inbound audio signal, by applying a second machine-learning architecture to the feature vector for the one or more acoustic features of the inbound audio signal; generate a first similarity score for the inbound speaker based upon the inbound speaker embedding and the enrolled voiceprint for the enrolled speaker, by applying the second machine-learning architecture; generate a second similarity score for verifying the inbound speaker, the second similarity score generated based upon the one or more quality measures and the first similarity score; and verify the inbound speaker as the enrolled speaker based upon comparing the second similarity score against a verification threshold. 9. The system according to claim 8 , wherein when generating the one or more quality measures for the inbound audio signal, the server is further configured to generate the overall quality measure based upon each of the quality measures. 10. The system according to claim 8 , wherein when generating the one or more quality measures, the server is further configured to: generate a plurality of speech segments from the inbound audio signal; and determine a total duration of speech based upon the plurality of speech segments. 11. The system according to claim 8 , wherein when generating a quality measure the server is configured to: determine a similarity between the inbound speaker embedding and a corresponding enrolled speaker embedding for an enrolled audio signal, wherein the quality measure is based upon the similarity. 12. The system according to claim 8 , wherein the server is further configured to: receive one or more clean enrollment audio signals for the enrolled speaker; generate one or more degraded enrollment audio signals corresponding to the one or more clean enrollment audio signals according to a type of degradation; and extract one or more enrolled quality embeddings for the enrolled speaker by applying the first machine-learning architecture on the one or more clean enrollment audio signals and the one or more degraded enrollment audio signals. 13. A non-transitory computer-readable medium comprising a non-transitory storage memory configured to store machine-readable instructions that when executed by a processor instruct the processor to: extract from an inbound audio signal for an inbound speaker a feature vector for one or more acoustic features; generate one or more quality measures and an overall quality measure for the inbound audio signal, by applying a first machine-learning architecture to the feature vector for the one or more acoustic features; extract an inbound speaker embedding for the inbound speaker from the one or more acoustic features for the inbound audio signal, by applying a second machine-learning architecture to the feature vector for the one or more acoustic features of the inbound audio signal, the one or more quality measures corresponding to a similarity between one or more expected quality descriptors and one or more quality descriptors for the call audio of the inbound audio signal; generate a first similarity score for the inbound speaker based upon the inbound speaker embedding and an enrolled voiceprint for an enrolled speaker, by applying the second machine-learning architecture; generate a second similarity score for verifying the inbound speaker, the second similarity score generated based upon the one or more quality measures and the first similarity score; and verify the inbound speaker as the enrolled speaker based upon comparing the second similarity score against a verification threshold. 14
Convolutional networks [CNN, ConvNet] · CPC title
Supervised learning · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Ensemble learning · CPC title
Combinations of networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.