Acoustic model learning device, voice synthesis device, and program
US-2022051655-A1 · Feb 17, 2022 · US
US2021390944A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2021390944-A1 |
| Application number | US-202117341082-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jun 7, 2021 |
| Priority date | Jun 12, 2020 |
| Publication date | Dec 16, 2021 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A discriminator trained on labeled samples of speech can compute probabilities of voice properties. A speech synthesis generative neural network that takes in text and continuous scale values of voice properties is trained to synthesize speech audio that the discriminator will infer as matching the values of the input voice properties. Voice parameters can include speaker voice parameters, accents, and attitudes, among others. Training can be done by transfer learning from an existing neural speech synthesis model or such a model can be trained with a loss function that considers speech and parameter values. A graphical user interface can allow voice designers for products to synthesize speech with a desired voice or generate a speech synthesis engine with frozen voice parameters. A vector of parameters can be used for comparison to previously registered voices in databases such as ones for trademark registration.
Opening claim text (preview).
1 . A computerized process of training a neural speech synthesis model that can generate speech audio conditioned on a value of a voice property, the computerized process comprising: obtaining source samples of speech audio; labeling the source samples with discrete values of a voice property; training, from the source samples and labels, a discriminator that can compute a probability of the voice property from a sample of speech audio; and training the neural speech synthesis model by: synthesizing a multiplicity of synthesized speech samples using the neural speech synthesis model with a multiplicity of values of the voice property to generate synthesized speech samples, computing corresponding probabilities for the synthesized speech samples using the discriminator, and computing a property-learning weight adjustment to the neural speech synthesis model by back-propagating changes to minimize a loss function that depends on differences between values of the voice property and corresponding probabilities. 2 . A speech synthesis model obtained by the computerized process of claim 1 . 3 . The computerized process of claim 1 , wherein the synthesizing uses a transcription of source samples, the computerized process further comprising: computing a source-matching weight adjustment by back-propagating changes to minimize a loss function that depends on differences between the source samples and the synthesized speech samples. 4 . A speech synthesis model obtained by the computerized process of claim 3 . 5 . The computerized process of claim 1 , wherein the source samples of the speech audio are obtained from one of a person and an audio generation system. 6 . The computerized process of claim 1 , wherein the voice property includes at least one of a gender voice property, an age voice property, an accent voice property, a timbre voice property, or an attitude voice property. 7 . A computerized method of synthesizing speech audio, the computerized method comprising: receiving a string of text and at least one voice property value with a perceptible meaning; synthesizing speech audio corresponding to the string of text using a neural speech synthesis model that conditions a sound of speech audio on the at least one voice property value to generate synthesized speech audio; and outputting the synthesized speech audio, wherein the sound of the synthesized speech audio perceptually relates to the at least one voice property value. 8 . The computerized method of claim 7 , wherein the at least one voice property value includes at least one of a gender voice property, an age voice property, an accent voice property, a timbre voice property, or an attitude voice property. 9 . The computerized method of claim 7 , further comprising: enabling download of the synthesized speech audio. 10 . The computerized method of claim 7 , further comprising: enabling playback of the synthesized speech audio. 11 . The computerized method of claim 7 , wherein the string of text is associated with at least one text tag. 12 . The computerized method of claim 7 , wherein the string of text indicates dynamically configurable voice parameter values. 13 . The computerized method of claim 7 , further comprising: providing a graphical user interface that includes one of a text input field or a voice property value input field. 14 . A computerized method of configuring a speech synthesizer, the computerized method comprising: receiving at least one voice property value; generating code for execution by a computer, the code implementing a neural network wherein a node in a hidden layer includes, in its summation, a constant term derived from a product of the at least one voice property value and a weight learned from a training process; and outputting the code, wherein the code implements a speech synthesis function within the speech synthesizer. 15 . The computerized method of claim 14 , wherein the code is in a binary format. 16 . The computerized method of claim 14 , wherein the at least one voice property value constitutes a voice property vector, the computerized method further comprising: reading at least one stored voice property vector from a brand database; and computing a distance between the at least one stored voice property vector and the voice property vector to generate a computed distance. 17 . The computerized method of claim 16 , further comprising: determining that the computed distance satisfies a threshold distance; and generating an error message. 18 . The computerized method of claim 16 , further comprising: determining that the computed distance fails to satisfy a threshold distance; and storing the least one voice property value in the brand database. 19 . A computerized method of examining trademarks, the computerized method comprising: receiving a specimen of speech audio with an application for trademark registration; applying a discriminator of a plurality of voice property values to the specimen to compute a voice property vector; computing distances between the voice property vector and other voice property vectors stored in a database; and determining allowability of the application in dependence upon a smallest distance being greater than a threshold. 20 . The computerized method of claim 19 , wherein computing the distances includes computing a cosine distance between the voice property vector and the other voice property vectors stored in the database.
Combinations of networks · CPC title
Activation functions · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Supervised learning · CPC title
Transfer learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.