Deep neural network-based relationship analysis with multi-feature token model
US-10565498-B1 · Feb 18, 2020 · US
US12592239B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12592239-B2 |
| Application number | US-202418646310-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 25, 2024 |
| Priority date | Apr 28, 2023 |
| Publication date | Mar 31, 2026 |
| Grant date | Mar 31, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed are systems and methods including software processes executed by a server that detect audio-based synthetic speech (“deepfakes”) in a call conversation. Embodiments include systems and methods for detecting fraudulent presentation attacks using multiple functional engines that implement various fraud-detection techniques, to produce calibrated scores and/or fused scores. A computer may, for example, evaluate the audio quality of speech signals within audio signals, where speech signals contain the speech portions having speaker utterances.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method for detecting machine-based speech in calls, comprising: obtaining, by a computer, an inbound audio signal comprising a speech signal containing response content as an utterance of the speaker, wherein the response content in the speech signal purportedly matches to challenge content of a verification prompt; extracting, by the computer, a text embedding using a first set of features extracted for text of the challenge content, a spoken content embedding using a second set of features extracted for the speech signal, and a fakeprint using a third set one or more features extracted for one or more fraud artifacts of the speech signal; generating, by the computer, a content verification score based upon a distance between the text embedding and the spoken content embedding; executing, by the computer, a passive liveness detector to generate a passive liveness score for the inbound audio signal, the passive liveness detector having a set of layers of a machine-learning architecture trained to classify and score the input audio signal based upon the fakeprint extracted for the fraud artifacts of the inbound audio signal; generating, by the computer, a fused liveness score based upon the content verification score and the passive liveness score; and identifying, by the computer, the inbound audio signal as genuine or fraudulent based upon comparing the fused liveness score against an overall risk threshold. 2 . The method according to claim 1 , further comprising: extracting, by the computer, an inbound voiceprint for the speech signal using a fourth set of one or more features extracted for one or more acoustic features of the speech signal of the inbound audio signal; and generating, by the computer, a speaker verification score for the speech signal indicating a speaker recognition likelihood that the speaker is an enrolled user based upon a second distance between the inbound voiceprint and an enrolled voiceprint. 3 . The method according to claim 2 , wherein the computer generates the fused liveness score further using the speaker verification score. 4 . The method according to claim 1 , further comprising: generating, by the computer, one or more acoustic parameters corresponding to one or more types of degradation in the speech signal of the inbound audio signal; and generating, by the computer, a speech quality score for the speech signal based upon the one or more acoustic parameters. 5 . The method according to claim 4 , wherein generating the content verification score includes: calibrating, by the computer, the content verification score based upon the speech quality score. 6 . The method according to claim 4 , further comprising: determining, by the computer, that the speech quality score for the speech signal fails a speech quality threshold; and transmitting, by the computer, to the user device a request for an improved speech signal for the caller. 7 . The method according to claim 1 , further comprising: extracting, by the computer, an inbound audioprint using one or more features extracted from the audio signal; generating, by the computer, an audio replay score for the inbound audio signal indicating an audio recording recognition likelihood that the inbound audio signal matches a prior audio signal based upon a distance between the inbound audioprint and a stored audioprint for the prior audio signal. 8 . The method according to claim 7 , further comprising identifying, by the computer, the inbound audio signal as fraudulent, in response to determining that the audio replay score satisfies a replay detection threshold value. 9 . The method according to claim 7 , further comprising storing, by the computer, the inbound audioprint into a database as a new stored audioprint. 10 . The method according to claim 1 , further comprising generating, by the computer, a verification prompt including the challenge content for display at a user interface of the user device associated with the caller. 11 . A system for detecting machine-based speech in calls, comprising: a computer having at least one processor, configured to: obtain an inbound audio signal comprising a speech signal containing response content as an utterance of the speaker, wherein the response content in the speech signal purportedly matches to challenge content of a verification prompt; extract a text embedding using a first set of features extracted for text of the challenge content, a spoken content embedding using a second set of features extracted for the speech signal, and a fakeprint using a third set one or more features extracted for one or more fraud artifacts of the speech signal; generate a content verification score based upon a distance between the text embedding and the spoken content embedding; execute a passive liveness detector having a set of layers of a machine-learning architecture to generate a passive liveness score for the inbound audio signal, the passive liveness detector trained to classify and score the input audio signal based upon the fakeprint extracted for the fraud artifacts of the inbound audio signal; generate a fused liveness score based upon the content verification score and the passive liveness score; and identify the inbound audio signal as genuine or fraudulent based upon comparing the fused liveness score against an overall risk threshold. 12 . The system according to claim 11 , wherein the computer is further configured to: extract an inbound voiceprint for the speech signal using a fourth set of one or more features extracted for one or more acoustic features of the speech signal of the inbound audio signal; and generate a speaker verification score for the speech signal indicating a speaker recognition likelihood that the speaker is an enrolled user based upon a second distance between the inbound voiceprint and an enrolled voiceprint. 13 . The system according to claim 12 , wherein the computer generates the fused liveness score further using the speaker verification score. 14 . The system according to claim 11 , wherein the computer is further configured to: generate one or more acoustic parameters corresponding to one or more types of degradation in the speech signal of the inbound audio signal; and generate a speech quality score for the speech signal based upon the one or more acoustic parameters. 15 . The system according to claim 14 , wherein when generating the content verification score the computer is further configured to calibrate the content verification score based upon the speech quality score. 16 . The system according to claim 14 , wherein the computer is further configured to: determine that the speech quality score for the speech signal fails a speech quality threshold; and transmit to the user device a request for an improved speech signal for the caller. 17 . The system according to claim 11 , wherein the computer is further configured to: extract an inbound audioprint using one or more features extracted from the audio signal; generate an audio replay score for the inbound audio signal indicating an audio recording recognition likelihood that the inbound audio signal matches a prior audio signal based upon a distance between the inbound audioprint and a stored audioprint for the prior audio signal. 18 . The system according to claim 17 , wherein the computer is further configured to identify the inbound audio signal as fraudulent, in response to determining that the audio replay score satis
Conversation recording systems (at the subscriber's set H04M1/656) · CPC title
for comparison or discrimination · CPC title
Call or contact centers supervision arrangements · CPC title
Artificial neural networks; Connectionist approaches · CPC title
using biometric data, e.g. fingerprints, iris scans or voiceprints · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.