Configurable neural speech synthesis

US2021390944A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021390944-A1
Application numberUS-202117341082-A
CountryUS
Kind codeA1
Filing dateJun 7, 2021
Priority dateJun 12, 2020
Publication dateDec 16, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A discriminator trained on labeled samples of speech can compute probabilities of voice properties. A speech synthesis generative neural network that takes in text and continuous scale values of voice properties is trained to synthesize speech audio that the discriminator will infer as matching the values of the input voice properties. Voice parameters can include speaker voice parameters, accents, and attitudes, among others. Training can be done by transfer learning from an existing neural speech synthesis model or such a model can be trained with a loss function that considers speech and parameter values. A graphical user interface can allow voice designers for products to synthesize speech with a desired voice or generate a speech synthesis engine with frozen voice parameters. A vector of parameters can be used for comparison to previously registered voices in databases such as ones for trademark registration.

First claim

Opening claim text (preview).

1 . A computerized process of training a neural speech synthesis model that can generate speech audio conditioned on a value of a voice property, the computerized process comprising: obtaining source samples of speech audio; labeling the source samples with discrete values of a voice property; training, from the source samples and labels, a discriminator that can compute a probability of the voice property from a sample of speech audio; and training the neural speech synthesis model by: synthesizing a multiplicity of synthesized speech samples using the neural speech synthesis model with a multiplicity of values of the voice property to generate synthesized speech samples, computing corresponding probabilities for the synthesized speech samples using the discriminator, and computing a property-learning weight adjustment to the neural speech synthesis model by back-propagating changes to minimize a loss function that depends on differences between values of the voice property and corresponding probabilities. 2 . A speech synthesis model obtained by the computerized process of claim 1 . 3 . The computerized process of claim 1 , wherein the synthesizing uses a transcription of source samples, the computerized process further comprising: computing a source-matching weight adjustment by back-propagating changes to minimize a loss function that depends on differences between the source samples and the synthesized speech samples. 4 . A speech synthesis model obtained by the computerized process of claim 3 . 5 . The computerized process of claim 1 , wherein the source samples of the speech audio are obtained from one of a person and an audio generation system. 6 . The computerized process of claim 1 , wherein the voice property includes at least one of a gender voice property, an age voice property, an accent voice property, a timbre voice property, or an attitude voice property. 7 . A computerized method of synthesizing speech audio, the computerized method comprising: receiving a string of text and at least one voice property value with a perceptible meaning; synthesizing speech audio corresponding to the string of text using a neural speech synthesis model that conditions a sound of speech audio on the at least one voice property value to generate synthesized speech audio; and outputting the synthesized speech audio, wherein the sound of the synthesized speech audio perceptually relates to the at least one voice property value. 8 . The computerized method of claim 7 , wherein the at least one voice property value includes at least one of a gender voice property, an age voice property, an accent voice property, a timbre voice property, or an attitude voice property. 9 . The computerized method of claim 7 , further comprising: enabling download of the synthesized speech audio. 10 . The computerized method of claim 7 , further comprising: enabling playback of the synthesized speech audio. 11 . The computerized method of claim 7 , wherein the string of text is associated with at least one text tag. 12 . The computerized method of claim 7 , wherein the string of text indicates dynamically configurable voice parameter values. 13 . The computerized method of claim 7 , further comprising: providing a graphical user interface that includes one of a text input field or a voice property value input field. 14 . A computerized method of configuring a speech synthesizer, the computerized method comprising: receiving at least one voice property value; generating code for execution by a computer, the code implementing a neural network wherein a node in a hidden layer includes, in its summation, a constant term derived from a product of the at least one voice property value and a weight learned from a training process; and outputting the code, wherein the code implements a speech synthesis function within the speech synthesizer. 15 . The computerized method of claim 14 , wherein the code is in a binary format. 16 . The computerized method of claim 14 , wherein the at least one voice property value constitutes a voice property vector, the computerized method further comprising: reading at least one stored voice property vector from a brand database; and computing a distance between the at least one stored voice property vector and the voice property vector to generate a computed distance. 17 . The computerized method of claim 16 , further comprising: determining that the computed distance satisfies a threshold distance; and generating an error message. 18 . The computerized method of claim 16 , further comprising: determining that the computed distance fails to satisfy a threshold distance; and storing the least one voice property value in the brand database. 19 . A computerized method of examining trademarks, the computerized method comprising: receiving a specimen of speech audio with an application for trademark registration; applying a discriminator of a plurality of voice property values to the specimen to compute a voice property vector; computing distances between the voice property vector and other voice property vectors stored in a database; and determining allowability of the application in dependence upon a smallest distance being greater than a threshold. 20 . The computerized method of claim 19 , wherein computing the distances includes computing a cosine distance between the voice property vector and the other voice property vectors stored in the database.

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Activation functions · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Supervised learning · CPC title

  • Transfer learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021390944A1 cover?
A discriminator trained on labeled samples of speech can compute probabilities of voice properties. A speech synthesis generative neural network that takes in text and continuous scale values of voice properties is trained to synthesize speech audio that the discriminator will infer as matching the values of the input voice properties. Voice parameters can include speaker voice parameters, acce…
Who is the assignee on this patent?
Soundhound Inc
What technology area does this patent fall under?
Primary CPC classification G10L13/047. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 16 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).