Automatic speech recognition with voice personalization and generalization

US12505830B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12505830-B2
Application numberUS-202218046137-A
CountryUS
Kind codeB2
Filing dateOct 12, 2022
Priority dateOct 12, 2022
Publication dateDec 23, 2025
Grant dateDec 23, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A voice morphing model can transform diverse voices to one or a small number of target voices. Speech recognition on diverse voices can be performed by morphing it to a target voice and then performing recognition on audio with the target voice. A source of requests for speech recognition can pass audio and a voiceprint with requests. Speech recognition can run with improved accuracy by biasing an acoustic model for the voice in the audio using the voiceprint. The audio can be used to calculate a new voiceprint, which can be used to update the voiceprint included with the audio. The updated voiceprint can be sent back to the source and then used with future speech recognition requests.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A computer-implemented method of training an acoustic model, the method comprising: obtaining a voiceprint calculator that calculates a score for the distance between a voice in speech audio and a target voice; training a voice morphing model to morph speech audio to the target voice, the training using speech audio of multiple distinct voices with a loss function dependent on the score; training an acoustic model on transcribed speech in the target voice; and tuning the voice morphing model and acoustic model by backpropagation of error reduction based on a measurement of the error rate of phoneme inference, wherein the acoustic model can infer phonemes from audio morphed by the voice morphing model. 2 . The method of claim 1 wherein the transcribed speech in the target voice is from a single speaker without morphing. 3 . The method of claim 1 wherein the transcribed speech in the target voice is generated by morphing speech audio of multiple distinct voices. 4 . The method of claim 3 further comprising: finetuning the voice morphing model with a loss function dependent on an error rate of the acoustic model when run on the morphed audio of transcribed speech. 5 . The method of claim 1 further comprising tuning the voice morphing model while keeping the acoustic model fixed. 6 . The method of claim 1 further comprising tuning the acoustic model while keeping the voice morphing model fixed. 7 . The method of claim 1 further comprising measuring the amount of noise in the morphed speech audio, wherein the loss function further depends on the amount of noise. 8 . A computer implemented method of phoneme inference, the method comprising: calculating a plurality of scores for the distances between a voice in speech audio from multiple distinct voices and a target voice; training a voice morphing model to morph speech audio to the target voice, the training using speech audio of the multiple distinct voices with a loss function dependent on the scores; morphing audio of sampled speech to a target voice using the voice morphing model to generate morphed audio; and inferring a sequence of phonemes from the morphed audio using an acoustic model, wherein the acoustic model has an accuracy bias in favor of the target voice. 9 . The method of claim 8 wherein the acoustic model is conditioned by a choice of the target voice from among a plurality of target voices. 10 . A computer-implemented method of training an acoustic model, the method comprising: obtaining a voiceprint calculator that calculates a plurality of scores for the distances between each voice of multiple distinct voices in speech audio and a target voice; training a voice morphing model to morph speech audio to the target voice, the training using speech audio of the multiple distinct voices with a loss function dependent on the scores; and training an acoustic model on transcribed speech in the target voice, wherein the acoustic model can infer phonemes from audio morphed by the voice morphing model. 11 . The method of claim 1 wherein the transcribed speech in the target voice is from a single speaker without morphing. 12 . The method of claim 1 wherein the transcribed speech in the target voice is generated by morphing speech audio of multiple distinct voices. 13 . The method of claim 12 further comprising: finetuning the voice morphing model with a loss function dependent on an error rate of the acoustic model when run on the morphed audio of transcribed speech. 14 . The method of claim 1 further comprising tuning the voice morphing model while keeping the acoustic model fixed. 15 . The method of claim 1 further comprising tuning the acoustic model while keeping the voice morphing model fixed. 16 . The method of claim 1 further comprising tuning the voice morphing model and acoustic model by backpropagation of error reduction based on a measurement of the error rate of phoneme inference. 17 . The method of claim 1 further comprising measuring the amount of noise in the morphed speech audio, wherein the loss function further depends on the amount of noise.

Assignees

Inventors

Classifications

  • Training · CPC title

  • Voice conversion or morphing · CPC title

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • G10L15/18Primary

    using natural language modelling · CPC title

  • G10L21/007Primary

    characterised by the process used · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12505830B2 cover?
A voice morphing model can transform diverse voices to one or a small number of target voices. Speech recognition on diverse voices can be performed by morphing it to a target voice and then performing recognition on audio with the target voice. A source of requests for speech recognition can pass audio and a voiceprint with requests. Speech recognition can run with improved accuracy by biasing…
Who is the assignee on this patent?
Soundhound Inc, Soundhound Ai Ip Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/18. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 23 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).