Self-supervised speech representations for fake audio detection
US-11756572-B2 · Sep 12, 2023 · US
US2025029601A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025029601-A1 |
| Application number | US-202418769197-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jul 10, 2024 |
| Priority date | Jul 21, 2023 |
| Publication date | Jan 23, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In general, the disclosure describes techniques for detecting synthetic speech of a speaker. In an example, a machine learning system may be configured to generate, using a deep learning model trained to distinguish between synthetic speech and authentic speech, reference embeddings for the speaker that characterize a first set of acoustic features and a first set of phonetic features associated with the speaker. The machine learning system may further be configured to generate, using the deep learning model, a test embedding for an audio clip that characterizes a second set of acoustic features and a second set of phonetic features associated with the audio clip. The machine learning system may further be configured to compute a score based on the test embedding and the reference embeddings. The machine learning system may further be configured to output, based on the score, an indication of whether the audio clip includes synthetic speech.
Opening claim text (preview).
What is claimed is: 1 . A method for detecting synthetic speech of a speaker in an audio clip, comprising: generating, using a deep learning model trained to distinguish between synthetic speech and authentic speech, one or more reference embeddings for the speaker, wherein the one or more reference embeddings characterize a first set of acoustic features and a first set of phonetic features associated with the speaker; generating, using the deep learning model trained to distinguish between synthetic speech and authentic speech, a test embedding for an audio clip that characterizes a second set of acoustic features and a second set of phonetic features associated with the audio clip; computing a score based on the test embedding and the one or more reference embeddings; and outputting, based on the score, an indication of whether the audio clip includes synthetic speech. 2 . The method of claim 1 , wherein generating the one or more reference embeddings for the speaker comprises: extracting, based on one or more sample audio clips of the speaker speaking, the first set of acoustic features and the first set of phonetic features; combining the first set of acoustic features and the first set of phonetic features to generate an enrollment feature vector; and generating the one or more reference embeddings based on the enrollment feature vector, wherein the one or more reference embeddings include speaker specific information, and wherein computing the score based on the one or more reference embeddings is more speaker aware in order to distinguish between synthetic speech and authentic speech of the speaker. 3 . The method of claim 1 , wherein generating the test embedding for the audio clip comprises: extracting, based on the audio clip, the second set of acoustic features and the second set of phonetic features; combining the second set of acoustic features and the second set of phonetic features to generate a test feature vector; and generating the test embedding based on the test feature vector. 4 . The method of claim 1 , wherein outputting the indication comprises: based on the score satisfying a threshold, outputting an indication that the audio clip includes synthetic speech. 5 . The method of claim 1 , wherein computing the score based on the test embedding and the one or more reference embeddings comprises: computing one or more log-likelihood ratios by comparing the test embedding to the one or more reference embeddings. 6 . The method of claim 1 , wherein computing the score comprises: computing a raw score based on a comparison of the test embedding to the one or more reference embeddings; and computing, based on a calibration of the raw score, the score. 7 . The method of claim 1 , wherein the first set of acoustic features and the second set of acoustic features correspond to features associated with characteristics of frequency components of audio signals. 8 . The method of claim 1 , wherein the first set of phonetic features and the second set of phonetic features correspond to features associated with characteristics of phones or phonemes included in audio signals. 9 . The method of claim 1 , further comprising: training, based on training data, a deep learning model to generate the one or more reference embeddings and generate the test embedding, wherein the training data includes sample speech clips labeled for authentic speech and synthetic speech. 10 . The method of claim 9 , wherein the deep learning model includes a residual network architecture. 11 . A computing system comprising processing circuitry and memory for executing a machine learning system, the machine learning system configured to: generate, using a deep learning model trained to distinguish between synthetic speech and authentic speech, one or more reference embeddings for the registered speaker, wherein the one or more reference embeddings specify characterize a first set of acoustic features and a first set of phonetic features associated with the registered speaker; generate, using the deep learning model trained to distinguish between synthetic speech and authentic speech, a test embedding for an audio clip that specifies characterizes a second set of acoustic features and a second set of phonetic features associated with the audio clip; compute a score based on the test embedding and the one or more reference embeddings; and output, based on the score, an indication of whether the audio clip includes synthetic speech. 12 . The computing system of claim 11 , wherein to generate the one or more reference embeddings for the speaker, the machine learning system is configured to: extract, based on one or more sample audio clips of the speaker speaking, the first set of acoustic features and the first set of phonetic features; combine the first set of acoustic features and the first set of phonetic features to generate an enrollment feature vector; and generate the one or more reference embeddings based on the enrollment feature vector, wherein the reference embeddings include speaker specific information, and wherein computing the score based on the one or more reference embeddings is more speaker aware in order to distinguish between synthetic speech and authentic speech of the speaker. 13 . The computing system of claim 11 , wherein to generate the test embedding for the audio clip, the machine learning system is configured to: extract, based on the audio clip, the second set of acoustic features and the second set of phonetic features; combine the second set of acoustic features and the second set of phonetic features to generate a test feature vector; and generate the test embedding based on the test feature vector. 14 . The computing system of claim 11 , wherein to output the indication, the machine learning system is configured to output, based on the score satisfying a threshold, an indication that the audio clip includes synthetic speech. 15 . The computing system of claim 11 , wherein to compute the score, the machine learning system is configured to compute one or more log-likelihood ratios by comparing the test embedding to the one or more reference embeddings. 16 . The computing system of claim 11 , wherein the first set of acoustic features and the second set of acoustic features correspond to features associated with characteristics of frequency components of audio signals. 17 . The computing system of claim 11 , wherein the first set of phonetic features and the second set of phonetic features correspond to features associated with characteristics of phones or phonemes included in audio signals. 18 . Computer-readable storage media comprising machine readable instructions for configuring processing circuitry to: generate, using a deep learning model trained to distinguish between synthetic speech and authentic speech, one or more reference embeddings for the registered speaker, wherein the one or more reference embeddings specify characterize a first set of acoustic features and a first set of phonetic features associated with the registered speaker; generate, using a deep learning model trained to distinguish between synthetic speech and authentic speech, a test embedding for an audio clip that specifies characterizes a second set of acoustic features and a second set of phonetic features associated with the audio clip; compute a score based on the test embedding and the one or more reference embeddings; and output, based on the score, an indication of whether the audio clip includes syntheti
Use of distortion metrics or a particular distance between probe pattern and reference templates · CPC title
for comparison or discrimination · CPC title
Training, enrolment or model building · CPC title
Artificial neural networks; Connectionist approaches · CPC title
Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.