Image description generation for screen readers
US-2024013768-A1 · Jan 11, 2024 · US
US11769481B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11769481-B2 |
| Application number | US-202117496569-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 7, 2021 |
| Priority date | Oct 7, 2021 |
| Publication date | Sep 26, 2023 |
| Grant date | Sep 26, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Generation of synthetic speech from an input text sequence may be difficult when durations of individual phonemes forming the input text sequence are unknown. A predominantly parallel process may model speech rhythm as a separate generative distribution such that phoneme duration may be sampled at inference. Additional information such as pitch or energy may also be sampled to provide improved diversity for synthetic speech generation.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method, comprising: determining, from a plurality of audio segments, respective phoneme durations, phoneme pitches, and phoneme energies; determining a first alignment between a sequence of text and a total speech duration corresponding to probable locations for the respective phoneme durations; determining a second alignment for an audio segment of synthesized speech based, at least in part, on a first distribution corresponding to the phoneme durations and the first alignment; and generating, for the sequence of text, an audio segment comprising a synthesized recitation of the sequence of text based, at least in part, on the second alignment and at least one of a second distribution corresponding to the phoneme pitches, or a third distribution corresponding to the phoneme energies. 2. The computer-implemented method of claim 1 , further comprising: generating a fourth distribution corresponding to one or more properties associated with the synthesized recitation based, at least in part, on the fourth distribution. 3. The computer-implemented method of claim 1 , further comprising: applying, to the second alignment, a prior distribution to exclude pairs of phonemes and durations from the plurality of audio segments that are outside of a specified range. 4. The computer-implemented method of claim 3 , wherein the prior distribution is cigar-shaped. 5. The computer-implemented method of claim 3 , wherein the prior distribution is constructed from a beta-binomial distribution. 6. The computer-implemented method of claim 1 , further comprising: determining, from the sequence of text, a plurality of text tokens; and aligning each of the plurality of text tokens to a respective mel frame based, at least in part, on the second alignment. 7. The computer-implemented method of claim 6 , wherein the second alignment is based, at least in part, on an L2 distance between the mel frame at a first time and a text phoneme in the sequence of text. 8. The computer-implemented method of claim 1 , wherein the synthesized recitation is generative such that a first synthesized recitation is different from a second synthesized recitation, each of the first synthesized recitation and the second synthesized recitation based on the sequence of text. 9. A method, comprising: determining, from a plurality of audio samples including human speech, alignments between text of the plurality of audio samples, a duration of the plurality of audio samples, and at least one of a pitch of the audio samples or an energy of the audio samples; generating an alignment distribution based, at least in part, on the alignments; determining a soft alignment between a first text sequence from the text and mel-frames of the alignment distribution, the soft alignment normalizing probability distributions for the alignments across the duration of the plurality of audio samples; and determining a hard alignment between the first text sequence and the mel-frames of the alignment distribution, the hard alignment concentrating the probability distributions from the soft alignment to a symbol for the alignments across the duration of the plurality of audio samples; determining one or more vectors, based on the hard alignment, corresponding to one or more speaker characteristics; receiving a second text sequence; and generating, based at least in part on the second text sequence and the one or more vectors, a synthetic audio clip corresponding to the second text sequence. 10. The method of claim 9 , wherein the alignment is based, at least in part, on an alignment matrix with a beta-binomial distribution. 11. The method of claim 9 , wherein an encoder and a decoder for generating the synthetic audio clip operate in parallel. 12. The method of claim 9 , further comprising: generating a second synthetic audio clip from the second text sequence, the second synthetic audio clip being different from the first synthetic audio clip. 13. The method of claim 9 , further comprising: generating at least one of a phoneme distribution, a pitch distribution, or an energy distribution; and sampling from at least one of the phoneme distribution, the pitch distribution, or the energy distribution. 14. A processor, comprising: one or more processing units to: receive an audio clip of human speech represented as a mel-spectrogram; determine an alignment matrix for the audio clip normalized to a probability distribution; apply, to the alignment matrix, a prior distribution to exclude pairs of phonemes and mel-frames in the audio clip from the alignment matrix; determine, from the alignment matrix, an alignment for a first text sequence within the audio clip and a plurality of mel-frames representing a duration of the audio clip; determine one or more distributions for a pitch and an energy associated with the audio clip; receive a second text sequence; and generate a second audio clip of the second text sequence based, at least in part, on the alignment and the one or more distributions. 15. The processor of claim 14 , wherein the one or more processing units are further to implement an encoder and decoder for generating the second audio clip, wherein the encoder and decoder operate in parallel. 16. The processor of claim 14 , wherein the one or more processing units are further to: fit, onto the alignment matrix, a beta-binomial distribution. 17. The processor of claim 16 , wherein the beta-binomial distribution excludes pairs of phonemes and mel-frames outside of a specified range. 18. The processor of claim 14 , wherein the one or more processing units are further to generate a third audio clip from the second text sequence, the second audio clip being different from the third audio clip.
Architecture of speech synthesisers · CPC title
Combinations of networks · CPC title
Learning methods · CPC title
Pitch control · CPC title
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.