Voice conversion method and related device

US12475878B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12475878-B2
Application numberUS-202318186285-A
CountryUS
Kind codeB2
Filing dateMar 20, 2023
Priority dateSep 21, 2020
Publication dateNov 18, 2025
Grant dateNov 18, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A voice conversion method and a related device are provided to implement diversified human voice beautification. The method includes receiving a mode selection operation input by a user, where the mode selection operation is for selecting a voice conversion mode. A plurality of provided selectable modes include a style conversion mode, for performing speaking style conversion on a to-be-converted first voice; a dialect conversion mode, for adding an accent to or removing an accent from the first voice; and a voice enhancement mode, for implementing voice enhancement on the first voice. The three modes have corresponding voice conversion networks. Based on a target conversion mode selected by the user, a target voice conversion network corresponding to the target conversion mode is selected to convert the first voice, and output a second voice obtained through conversion.

First claim

Opening claim text (preview).

What is claimed is: 1 . A voice conversion method, comprising: receiving a mode selection operation input by a user, wherein the mode selection operation is for selecting a voice conversion mode; selecting a target conversion mode from a plurality of conversion modes based on the mode selection operation, wherein the plurality of conversion modes comprises a style conversion mode, a dialect conversion mode, and a voice enhancement mode, and each of the plurality of conversion modes corresponds to one of multiple voice processing neural networks comprising a style conversion network, a dialect conversion network, and a voice enhancement network; selecting, among the multiple voice processing neural networks, a target voice conversion network corresponding to the selected target conversion mode: obtaining a to-be-converted first voice; extracting feature information of the first voice; inputting the feature information of the first voice into target voice conversion network corresponding to the selected target conversion mode, and outputting, over the target voice conversion network, a second voice obtained through conversion; and outputting the second voice. 2 . The method according to claim 1 , wherein the extracting feature information of the first voice comprises: inputting the feature information of the first voice into a voice feature extraction model, and extracting a phoneme posteriorgram (PPG) feature of the first voice by using the voice feature extraction model, wherein the PPG feature is for retaining content information of the first voice. 3 . The method according to claim 1 , wherein the target conversion mode is the style conversion mode, the target voice conversion network is a style conversion network including a style separation model and a voice fusion model, and the method further comprises: obtaining a third voice for extracting a style feature; inputting the third voice into the style separation model, and separating the style feature of the third voice by using the style separation model; and the inputting the feature information of the first voice into a target voice conversion network corresponding to the target conversion mode, and outputting, over the target voice conversion network, a second voice obtained through conversion comprises: inputting the style feature and the feature information of the first voice into the voice fusion model for fusion, to obtain the second voice. 4 . The method according to claim 3 , wherein the style feature comprises a first feature including a plurality of sub-features; and the inputting the third voice into the style separation model, and separating the style feature of the third voice by using the style separation model comprises: inputting the third voice into the style separation model, and extracting a vector of the first feature of the third voice by using the style separation model; inputting the third voice into a sub-feature extraction model, and extracting a vector of each of the plurality of sub-features by using the sub-feature extraction model; receiving a weight of each of the plurality of sub-features that is input by the user; and determining the style feature of the third voice based on the vector of the first feature, and the vector and weight of each of the plurality of sub-features. 5 . The method according to claim 4 , wherein the determining the style feature of the third voice based on the vector of the first feature, and the vector and weight of each of the plurality of sub-features comprises: inputting the vector of the first feature into a multi-head attention structure, inputting the vector of each of the plurality of sub-features and a product of the vector of each of the plurality of sub-features and the weight corresponding to the sub-feature into the multi-head attention structure, and outputting the style feature of the third voice by using the multi-head attention structure. 6 . The method according to claim 3 , wherein the obtaining a third voice for extracting a style feature comprises: receiving a template selection operation input by the user, wherein the template selection operation is for selecting a target template; and obtaining a voice corresponding to the target template, and using the voice corresponding to the target template as the third voice. 7 . The method according to claim 3 , wherein the obtaining a third voice for extracting a style feature comprises: receiving the third voice input by a second speaker, wherein the first voice is a voice of a first speaker, and the second speaker is a person different from the first speaker. 8 . The method according to claim 1 , wherein the target conversion mode is the dialect conversion mode, the target voice conversion network is a dialect conversion network, the inputting the feature information of the first voice into a target voice conversion network corresponding to the target conversion mode, and outputting, over the target voice conversion network, a second voice obtained through conversion comprises: inputting the feature information of the first voice into the dialect conversion network, and outputting the second voice over the dialect conversion network, wherein the first voice is a voice of a first dialect, and the second voice is a voice of a second dialect. 9 . The method according to claim 8 , wherein the dialect conversion network comprises a plurality of dialect conversion models, each of which is for a different dialect to be converted, and the method further comprises: receiving a selection operation input by the user; and inputting the feature information of the first voice into a dialect conversion model corresponding to the selection operation, and outputting the second voice by using the dialect conversion model corresponding to the selection operation. 10 . The method according to claim 8 , further comprising: inputting the first voice into a style separation model, and separating a style feature of the first voice by using the style separation model; and the inputting the feature information of the first voice into the dialect conversion network, and outputting the second voice over the dialect conversion network comprises: inputting the style feature of the first voice and the feature information of the first voice into the dialect conversion network, and outputting the second voice over the dialect conversion network, wherein a style of the second voice is the same as that of the first voice. 11 . The method according to claim 1 , wherein the first voice is a far-field voice, the target conversion mode is the voice enhancement mode, the target voice conversion network is a voice enhancement model, the inputting the feature information of the first voice into a target voice conversion network corresponding to the target conversion mode, and outputting, over the target voice conversion network, a second voice obtained through conversion comprises: inputting the feature information of the first voice into a voice enhancement model corresponding to the mode, and outputting the second voice by using the voice enhancement model, wherein the second voice is a near-field voice. 12 . The method according to claim 11 , further comprising: inputting the first voice into a style separation model, and separating a style feature of the first voice by using the style separation model; and inputting the feature information of the first voice into a voice enhancement model corresponding to the mode, and outputting the second voice by using the voice enhancement model comprises: inputting the style feature of the first voice and the feature information of the f

Assignees

Inventors

Classifications

  • Phonemes, fenemes or fenones being the recognition units · CPC title

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • Details of speech synthesis systems, e.g. synthesiser structure or memory management · CPC title

  • Changing voice quality, e.g. pitch or formants · CPC title

  • for voice messaging, e.g. dictaphones (for answering incoming calls H04M1/64) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12475878B2 cover?
A voice conversion method and a related device are provided to implement diversified human voice beautification. The method includes receiving a mode selection operation input by a user, where the mode selection operation is for selecting a voice conversion mode. A plurality of provided selectable modes include a style conversion mode, for performing speaking style conversion on a to-be-convert…
Who is the assignee on this patent?
Huawei Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L13/033. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).