What technology area does this patent fall under?

Primary CPC classification G06T11/23. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 16 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Speech-to-text voice visualization

US12417762B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12417762-B2
Application number	US-202217719543-A
Country	US
Kind code	B2
Filing date	Apr 13, 2022
Priority date	Apr 13, 2022
Publication date	Sep 16, 2025
Grant date	Sep 16, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for generating personalized audio data is disclosed. The computer-implemented method includes receiving user input data, wherein the user input data is at least one of text or audio. The computer-implemented method further includes segmenting the user input data into a set of sentences. The computer-implemented method further includes generating, for each sentence in the set of sentences, a voice image, wherein the voice image includes at least one pronunciation tag and wave line associated with a sentence. The computer-implemented method further includes modifying the user input data based, at least in part on, the wave line and pronunciation tag of the voice image.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for generating personalized audio data, the computer-implemented method comprising: receiving user input data, wherein the user input data is audio; segmenting the user input data into a set of sentences, wherein segmenting the user input data into a set of sentences further includes converting the audio to text; generating, for each sentence in the set of sentences, a voice image, wherein the voice image includes at least one pronunciation tag and a wave line associated with a sentence, and the wave line is a wavy line used to mark a height of intonation of each word of each sentence in the set of sentences, wherein the voice image is generated based on a volume and inflection point of each word in the sentence, and the volume is scaled to each other word in the sentence; storing the voice image in a temporary table; mapping, based on the wave line and the pronunciation tag, each sentence of the voice image to an emotion; modifying the user input data based, at least in part on, the wave line and the pronunciation tag of the voice image, wherein the modifying includes generating an audio output based on the mapping, playing the output, and tagging each sentence of the input with the mapped emotion; and training a learning model with the voice image, wherein the training is configured to cause the learning model to generate speech based on the user input data, wherein the audio output includes the mapped emotion in text form. 2. The computer-implemented method of claim 1 , wherein modifying the user input data is further based, at least in part, on: mapping the user input data to an emotion; and modifying the user input data based on the mapped emotion. 3. The computer-implemented method of claim 1 , further comprising: matching the generated voice image to a previously generated voice image based, at least in part, on the previously generated voice image having a highest degree of similarity to the generated voice image. 4. The computer-implemented method of claim 3 , wherein matching the generated voice image to a previously generated voice image is further based, at least in part, on comparing the pronunciation tag and the wave line between the sentences associated with the generated voice image and the sentence associated with the previously generated voice image. 5. The computer-implemented method of claim 1 , further comprising: displaying the generated voice image to a user as the modified user input data is played back to the user. 6. The computer-implemented method of claim 1 , wherein matching the generated voice image to a previously generated voice image is further based, at least in part, on comparing similar words, sentence content, and emotions between the generated voice image and the previously generated voice image. 7. A computer program product for generating personalized audio data, the computer program product comprising one or more computer readable storage media and program instructions stored on the one or more computer readable storage media, the program instructions including instructions to: receive user input data, wherein the user input data is audio; segment the user input data into a set of sentences wherein segmenting the user input data into a set of sentences further includes converting the audio to text; generate, for each sentence in the set of sentences, a voice image, wherein the voice image includes at least one pronunciation tag and wave line associated with a sentence and the wave line is a wavy line used to mark a height of intonation of each word of each sentence in the set of sentences, wherein the voice image is generated based on a volume and inflection point of each word in the sentence, and the volume is scaled to each other word in the sentence; store the voice image in a temporary table; map, based on the wave line and the pronunciation tag, each sentence of the voice image to an emotion; modify the user input data based, at least in part on, the wave line and pronunciation tag of the voice image, wherein the modifying includes generating an audio output based on the mapping, playing the output, and tagging each sentence of the input with the mapped emotion; and train a learning model with the voice image, wherein the training is configured to cause the learning model to generate speech based on the user input data, wherein the audio output includes the mapped emotion in text form. 8. The computer program product of claim 7 , wherein instructions to modify the user input data is further based, at least in part, on instructions to: map the user input data to an emotion; and modify the user input data based on the mapped emotion. 9. The computer program product of claim 7 , further comprising instructions to: match the generated voice image to a previously generated voice image based, at least in part, on the previously generated voice image having a highest degree of similarity to the generated voice image. 10. The computer program product of claim 9 , wherein the instructions to match the generated voice image to a previously generated voice image is further based, at least in part, on comparing. 11. The computer program product of claim 7 , further comprising instructions to: display the generated voice image to a user as the modified user input data is played back to the user. 12. The computer program product of claim 11 , wherein the instructions to match the generated voice image to a previously generated voice image is further based, at least in part, on instructions to compare similar words, sentence content, and emotions between the generated voice image and the previously generated voice image. 13. A computer system for generating personalized audio data, comprising: one or more computer processors; one or more computer readable storage media; and computer program instructions, the computer program instructions being stored on the one or more computer readable storage media for execution by the one or more computer processors, the computer program instructions including instructions to: receive user input data, wherein the user input data is audio; segment the user input data into a set of sentences, wherein segmenting the user input data into a set of sentences further includes converting the audio to text; generate, for each sentence in the set of sentences, a voice image, wherein the voice image includes at least one pronunciation tag and wave line associated with a sentence and the wave line is a wavy line used to mark a height of intonation of each word of each sentence in the set of sentences, wherein the voice image is generated based on a volume and inflection point of each word in the sentence, and the volume is scaled to each other word in the sentence; store the voice image in a temporary table; map, based on the wave line and the pronunciation tag, each sentence of the voice image to an emotion; modify the user input data based, at least in part on, the wave line and pronunciation tag of the voice image, wherein the modifying includes generating an audio output based on the mapping, playing the output, and tagging each sentence of the input with the mapped emotion; and train a learning model with the voice image, wherein the training is configured to cause the learning model to generate speech based on the user input data, wherein the audio output includes the mapped emotion in text form. 14. The computer system of claim 13 , wherein the instructions to modify the user input data is further based, at least in part, on instructions to: map the user input data to an emotion; a

Assignees

Inventors

Classifications

G06T11/23Primary
using straight lines or curves · CPC title
G10L15/04
Segmentation; Word boundary detection · CPC title
G10L15/22
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
G10L25/63
for estimating an emotional state · CPC title
G10L21/12
by displaying time domain information · CPC title

Patent family

Related publications grouped by family.

View patent family 88308241

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12417762B2 cover?: A computer-implemented method for generating personalized audio data is disclosed. The computer-implemented method includes receiving user input data, wherein the user input data is at least one of text or audio. The computer-implemented method further includes segmenting the user input data into a set of sentences. The computer-implemented method further includes generating, for each sentence …
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06T11/23. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 16 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).