Text-to-speech processing
US-2021097976-A1 · Apr 1, 2021 · US
US12417762B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12417762-B2 |
| Application number | US-202217719543-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 13, 2022 |
| Priority date | Apr 13, 2022 |
| Publication date | Sep 16, 2025 |
| Grant date | Sep 16, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer-implemented method for generating personalized audio data is disclosed. The computer-implemented method includes receiving user input data, wherein the user input data is at least one of text or audio. The computer-implemented method further includes segmenting the user input data into a set of sentences. The computer-implemented method further includes generating, for each sentence in the set of sentences, a voice image, wherein the voice image includes at least one pronunciation tag and wave line associated with a sentence. The computer-implemented method further includes modifying the user input data based, at least in part on, the wave line and pronunciation tag of the voice image.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for generating personalized audio data, the computer-implemented method comprising: receiving user input data, wherein the user input data is audio; segmenting the user input data into a set of sentences, wherein segmenting the user input data into a set of sentences further includes converting the audio to text; generating, for each sentence in the set of sentences, a voice image, wherein the voice image includes at least one pronunciation tag and a wave line associated with a sentence, and the wave line is a wavy line used to mark a height of intonation of each word of each sentence in the set of sentences, wherein the voice image is generated based on a volume and inflection point of each word in the sentence, and the volume is scaled to each other word in the sentence; storing the voice image in a temporary table; mapping, based on the wave line and the pronunciation tag, each sentence of the voice image to an emotion; modifying the user input data based, at least in part on, the wave line and the pronunciation tag of the voice image, wherein the modifying includes generating an audio output based on the mapping, playing the output, and tagging each sentence of the input with the mapped emotion; and training a learning model with the voice image, wherein the training is configured to cause the learning model to generate speech based on the user input data, wherein the audio output includes the mapped emotion in text form. 2. The computer-implemented method of claim 1 , wherein modifying the user input data is further based, at least in part, on: mapping the user input data to an emotion; and modifying the user input data based on the mapped emotion. 3. The computer-implemented method of claim 1 , further comprising: matching the generated voice image to a previously generated voice image based, at least in part, on the previously generated voice image having a highest degree of similarity to the generated voice image. 4. The computer-implemented method of claim 3 , wherein matching the generated voice image to a previously generated voice image is further based, at least in part, on comparing the pronunciation tag and the wave line between the sentences associated with the generated voice image and the sentence associated with the previously generated voice image. 5. The computer-implemented method of claim 1 , further comprising: displaying the generated voice image to a user as the modified user input data is played back to the user. 6. The computer-implemented method of claim 1 , wherein matching the generated voice image to a previously generated voice image is further based, at least in part, on comparing similar words, sentence content, and emotions between the generated voice image and the previously generated voice image. 7. A computer program product for generating personalized audio data, the computer program product comprising one or more computer readable storage media and program instructions stored on the one or more computer readable storage media, the program instructions including instructions to: receive user input data, wherein the user input data is audio; segment the user input data into a set of sentences wherein segmenting the user input data into a set of sentences further includes converting the audio to text; generate, for each sentence in the set of sentences, a voice image, wherein the voice image includes at least one pronunciation tag and wave line associated with a sentence and the wave line is a wavy line used to mark a height of intonation of each word of each sentence in the set of sentences, wherein the voice image is generated based on a volume and inflection point of each word in the sentence, and the volume is scaled to each other word in the sentence; store the voice image in a temporary table; map, based on the wave line and the pronunciation tag, each sentence of the voice image to an emotion; modify the user input data based, at least in part on, the wave line and pronunciation tag of the voice image, wherein the modifying includes generating an audio output based on the mapping, playing the output, and tagging each sentence of the input with the mapped emotion; and train a learning model with the voice image, wherein the training is configured to cause the learning model to generate speech based on the user input data, wherein the audio output includes the mapped emotion in text form. 8. The computer program product of claim 7 , wherein instructions to modify the user input data is further based, at least in part, on instructions to: map the user input data to an emotion; and modify the user input data based on the mapped emotion. 9. The computer program product of claim 7 , further comprising instructions to: match the generated voice image to a previously generated voice image based, at least in part, on the previously generated voice image having a highest degree of similarity to the generated voice image. 10. The computer program product of claim 9 , wherein the instructions to match the generated voice image to a previously generated voice image is further based, at least in part, on comparing. 11. The computer program product of claim 7 , further comprising instructions to: display the generated voice image to a user as the modified user input data is played back to the user. 12. The computer program product of claim 11 , wherein the instructions to match the generated voice image to a previously generated voice image is further based, at least in part, on instructions to compare similar words, sentence content, and emotions between the generated voice image and the previously generated voice image. 13. A computer system for generating personalized audio data, comprising: one or more computer processors; one or more computer readable storage media; and computer program instructions, the computer program instructions being stored on the one or more computer readable storage media for execution by the one or more computer processors, the computer program instructions including instructions to: receive user input data, wherein the user input data is audio; segment the user input data into a set of sentences, wherein segmenting the user input data into a set of sentences further includes converting the audio to text; generate, for each sentence in the set of sentences, a voice image, wherein the voice image includes at least one pronunciation tag and wave line associated with a sentence and the wave line is a wavy line used to mark a height of intonation of each word of each sentence in the set of sentences, wherein the voice image is generated based on a volume and inflection point of each word in the sentence, and the volume is scaled to each other word in the sentence; store the voice image in a temporary table; map, based on the wave line and the pronunciation tag, each sentence of the voice image to an emotion; modify the user input data based, at least in part on, the wave line and pronunciation tag of the voice image, wherein the modifying includes generating an audio output based on the mapping, playing the output, and tagging each sentence of the input with the mapped emotion; and train a learning model with the voice image, wherein the training is configured to cause the learning model to generate speech based on the user input data, wherein the audio output includes the mapped emotion in text form. 14. The computer system of claim 13 , wherein the instructions to modify the user input data is further based, at least in part, on instructions to: map the user input data to an emotion; a
using straight lines or curves · CPC title
Segmentation; Word boundary detection · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
for estimating an emotional state · CPC title
by displaying time domain information · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.