Detecting synthetic speech

US2025029601A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025029601-A1
Application numberUS-202418769197-A
CountryUS
Kind codeA1
Filing dateJul 10, 2024
Priority dateJul 21, 2023
Publication dateJan 23, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In general, the disclosure describes techniques for detecting synthetic speech of a speaker. In an example, a machine learning system may be configured to generate, using a deep learning model trained to distinguish between synthetic speech and authentic speech, reference embeddings for the speaker that characterize a first set of acoustic features and a first set of phonetic features associated with the speaker. The machine learning system may further be configured to generate, using the deep learning model, a test embedding for an audio clip that characterizes a second set of acoustic features and a second set of phonetic features associated with the audio clip. The machine learning system may further be configured to compute a score based on the test embedding and the reference embeddings. The machine learning system may further be configured to output, based on the score, an indication of whether the audio clip includes synthetic speech.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for detecting synthetic speech of a speaker in an audio clip, comprising: generating, using a deep learning model trained to distinguish between synthetic speech and authentic speech, one or more reference embeddings for the speaker, wherein the one or more reference embeddings characterize a first set of acoustic features and a first set of phonetic features associated with the speaker; generating, using the deep learning model trained to distinguish between synthetic speech and authentic speech, a test embedding for an audio clip that characterizes a second set of acoustic features and a second set of phonetic features associated with the audio clip; computing a score based on the test embedding and the one or more reference embeddings; and outputting, based on the score, an indication of whether the audio clip includes synthetic speech. 2 . The method of claim 1 , wherein generating the one or more reference embeddings for the speaker comprises: extracting, based on one or more sample audio clips of the speaker speaking, the first set of acoustic features and the first set of phonetic features; combining the first set of acoustic features and the first set of phonetic features to generate an enrollment feature vector; and generating the one or more reference embeddings based on the enrollment feature vector, wherein the one or more reference embeddings include speaker specific information, and wherein computing the score based on the one or more reference embeddings is more speaker aware in order to distinguish between synthetic speech and authentic speech of the speaker. 3 . The method of claim 1 , wherein generating the test embedding for the audio clip comprises: extracting, based on the audio clip, the second set of acoustic features and the second set of phonetic features; combining the second set of acoustic features and the second set of phonetic features to generate a test feature vector; and generating the test embedding based on the test feature vector. 4 . The method of claim 1 , wherein outputting the indication comprises: based on the score satisfying a threshold, outputting an indication that the audio clip includes synthetic speech. 5 . The method of claim 1 , wherein computing the score based on the test embedding and the one or more reference embeddings comprises: computing one or more log-likelihood ratios by comparing the test embedding to the one or more reference embeddings. 6 . The method of claim 1 , wherein computing the score comprises: computing a raw score based on a comparison of the test embedding to the one or more reference embeddings; and computing, based on a calibration of the raw score, the score. 7 . The method of claim 1 , wherein the first set of acoustic features and the second set of acoustic features correspond to features associated with characteristics of frequency components of audio signals. 8 . The method of claim 1 , wherein the first set of phonetic features and the second set of phonetic features correspond to features associated with characteristics of phones or phonemes included in audio signals. 9 . The method of claim 1 , further comprising: training, based on training data, a deep learning model to generate the one or more reference embeddings and generate the test embedding, wherein the training data includes sample speech clips labeled for authentic speech and synthetic speech. 10 . The method of claim 9 , wherein the deep learning model includes a residual network architecture. 11 . A computing system comprising processing circuitry and memory for executing a machine learning system, the machine learning system configured to: generate, using a deep learning model trained to distinguish between synthetic speech and authentic speech, one or more reference embeddings for the registered speaker, wherein the one or more reference embeddings specify characterize a first set of acoustic features and a first set of phonetic features associated with the registered speaker; generate, using the deep learning model trained to distinguish between synthetic speech and authentic speech, a test embedding for an audio clip that specifies characterizes a second set of acoustic features and a second set of phonetic features associated with the audio clip; compute a score based on the test embedding and the one or more reference embeddings; and output, based on the score, an indication of whether the audio clip includes synthetic speech. 12 . The computing system of claim 11 , wherein to generate the one or more reference embeddings for the speaker, the machine learning system is configured to: extract, based on one or more sample audio clips of the speaker speaking, the first set of acoustic features and the first set of phonetic features; combine the first set of acoustic features and the first set of phonetic features to generate an enrollment feature vector; and generate the one or more reference embeddings based on the enrollment feature vector, wherein the reference embeddings include speaker specific information, and wherein computing the score based on the one or more reference embeddings is more speaker aware in order to distinguish between synthetic speech and authentic speech of the speaker. 13 . The computing system of claim 11 , wherein to generate the test embedding for the audio clip, the machine learning system is configured to: extract, based on the audio clip, the second set of acoustic features and the second set of phonetic features; combine the second set of acoustic features and the second set of phonetic features to generate a test feature vector; and generate the test embedding based on the test feature vector. 14 . The computing system of claim 11 , wherein to output the indication, the machine learning system is configured to output, based on the score satisfying a threshold, an indication that the audio clip includes synthetic speech. 15 . The computing system of claim 11 , wherein to compute the score, the machine learning system is configured to compute one or more log-likelihood ratios by comparing the test embedding to the one or more reference embeddings. 16 . The computing system of claim 11 , wherein the first set of acoustic features and the second set of acoustic features correspond to features associated with characteristics of frequency components of audio signals. 17 . The computing system of claim 11 , wherein the first set of phonetic features and the second set of phonetic features correspond to features associated with characteristics of phones or phonemes included in audio signals. 18 . Computer-readable storage media comprising machine readable instructions for configuring processing circuitry to: generate, using a deep learning model trained to distinguish between synthetic speech and authentic speech, one or more reference embeddings for the registered speaker, wherein the one or more reference embeddings specify characterize a first set of acoustic features and a first set of phonetic features associated with the registered speaker; generate, using a deep learning model trained to distinguish between synthetic speech and authentic speech, a test embedding for an audio clip that specifies characterizes a second set of acoustic features and a second set of phonetic features associated with the audio clip; compute a score based on the test embedding and the one or more reference embeddings; and output, based on the score, an indication of whether the audio clip includes syntheti

Assignees

Inventors

Classifications

  • Use of distortion metrics or a particular distance between probe pattern and reference templates · CPC title

  • for comparison or discrimination · CPC title

  • Training, enrolment or model building · CPC title

  • Artificial neural networks; Connectionist approaches · CPC title

  • G10L17/26Primary

    Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025029601A1 cover?
In general, the disclosure describes techniques for detecting synthetic speech of a speaker. In an example, a machine learning system may be configured to generate, using a deep learning model trained to distinguish between synthetic speech and authentic speech, reference embeddings for the speaker that characterize a first set of acoustic features and a first set of phonetic features associate…
Who is the assignee on this patent?
Stanford Res Inst Int
What technology area does this patent fall under?
Primary CPC classification G10L17/26. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 23 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).