What technology area does this patent fall under?

Primary CPC classification G10L15/26. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Aug 01 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method and system for creating a prosodic script

US2024257801A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2024257801-A1
Application number	US-202318393575-A
Country	US
Kind code	A1
Filing date	Dec 21, 2023
Priority date	Feb 1, 2023
Publication date	Aug 1, 2024
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, apparatus, and system for creating a script for rendering audio and/or video streams include identifying at least one prosodic speech feature in a received audio stream and/or a received language model, creating a respective prosodic speech symbol for each of the at least one identified prosodic speech features, converting the received audio stream and/or the received language model into a text stream, temporally inserting the created at least one prosodic speech symbol into the text stream, identifying in a received video stream at least one prosodic gesture of at least a portion of a body of a speaker of the received audio stream, creating at least one respective gesture symbol for each of the at least one identified prosodic gestures, and temporally inserting the created at least one gesture symbol into the text stream along with the at least one prosodic speech symbol to create a prosodic script.

First claim

Opening claim text (preview).

1 . A method for creating a script for rendering audio and/or video streams, comprising: identifying at least one prosodic speech feature, in at least one of a received audio stream or a received language model, and/or at least one prosodic gesture in a received video stream; and automatically temporally annotating an associated text stream with at least one prosodic speech symbol created from the identified at least one prosodic speech feature and/or at least one prosodic gesture symbol created from the identified at least prosodic gesture to create a prosodic script, that when rendered, provides an audio stream and/or a video stream comprising the at least one prosodic speech feature and/or the at least one prosodic gesture that are temporally aligned. 2 . The method of claim 1 , further comprising: converting a received audio stream and/or language model into a text stream to create the associated text stream; and creating the at least one prosodic speech symbol and/or the at least one prosodic gesture symbol using stored, pre-determined symbols. 3 . The method of claim 1 , further comprising rendering the prosodic script to create at least one predicted audio stream and/or at least one predicted video stream; and comparing prosodic speech features of the at least one predicted audio stream and/or prosodic gestures of the at least one predicted video stream to prosodic speech features of a ground truth audio stream and/or prosodic gestures of a ground truth video stream to determine respective loss functions for training a system to create the prosodic script. 4 . The method of claim 1 , wherein the prosodic gestures are identified from movement of at least a portion of a body of a speaker of the received audio stream. 5 . The method of claim 4 , wherein the portion of a body of a speaker comprises a face of the speaker of the audio stream and that at least one prosodic gesture comprises a change in at least a portion of the face of the speaker, including at least one of a head, mouth, forehead, ears, chin, or eyes of the speaker of the received audio stream. 6 . The method of claim 1 , further comprising; creating a spectrogram of the received audio stream; rendering the spectrogram from the prosodic script to create a predicted spectrogram; and comparing the predicted spectrogram to the created spectrogram to determine a loss function for training a system to create the prosodic script. 7 . The method of claim 1 , wherein the at least one prosodic speech feature comprises at least one of an emphasis, a duration, or a pitch of a temporal portion of the received audio stream. 8 . An apparatus for creating a script for rendering audio and/or video streams, comprising: a processor; and a memory accessible to the processor, the memory having stored therein at least one of programs or instructions executable by the processor to configure the apparatus to: identify at least one prosodic speech feature, in at least one of a received audio stream or a received language model, and/or at least one prosodic gesture in a received video stream; and automatically temporally annotate an associated text stream with at least one prosodic speech symbol created from the identified at least one prosodic speech feature and/or at least one prosodic gesture symbol created from the identified at least prosodic gesture to create a prosodic script, that when rendered, provides an audio stream and/or a video stream comprising the at least one prosodic speech feature and/or the at least one prosodic gesture that are temporally aligned. 9 . The apparatus of claim 8 , wherein the apparatus is further configured to: convert a received audio stream and/or language model into a text stream to create the associated text stream; and create the at least one prosodic speech symbol and/or the at least one prosodic gesture symbol using stored, pre-determined symbols. 10 . The apparatus of claim 8 , wherein the apparatus is further configured to: render the prosodic script to create at least one predicted audio stream and/or at least one predicted video stream; and compare prosodic speech features of the at least one predicted audio stream and/or prosodic gestures of the at least one predicted video stream to prosodic speech features of a ground truth audio stream and/or prosodic gestures of a ground truth video stream to determine respective loss functions for training a system to create the prosodic script. 11 . The apparatus of claim 8 , wherein the prosodic gestures are identified from movement of at least a portion of a body of a speaker of the received audio stream. 12 . The apparatus of claim 11 , wherein the portion of a body of a speaker comprises a face of the speaker of the audio stream and that at least one prosodic gesture comprises a change in at least a portion of the face of the speaker, including at least one of a head, mouth, forehead, ears, chin, or eyes of the speaker of the received audio stream. 13 . The apparatus of claim 8 , wherein the apparatus is further configured to: create a spectrogram of the received audio stream; render the spectrogram from the prosodic script to create a predicted spectrogram; and compare the predicted spectrogram to the created spectrogram to determine a loss function for training a system to create the prosodic script. 14 . The apparatus of claim 8 , wherein the at least one prosodic speech feature comprises at least one of an emphasis, a duration, or a pitch of a temporal portion of the received audio stream. 15 . A system for creating a script for rendering audio and/or video streams, comprising: a spectral features module; a gesture features module; a streams to script module; and an apparatus comprising a processor and a memory accessible to the processor, the memory having stored therein at least one of programs or instructions executable by the processor to configure the apparatus to: identify, using the spectral features module and/or the gesture features module, at least one prosodic speech feature, in at least one of a received audio stream or a received language model, and/or at least one prosodic gesture in a received video stream; and automatically temporally annotate, using the streams to script module, an associated text stream with at least one prosodic speech symbol created from the identified at least one prosodic speech feature and/or at least one prosodic gesture symbol created from the identified at least prosodic gesture to create a prosodic script, that when rendered, provides an audio stream and/or a video stream comprising the at least one prosodic speech feature and/or the at least one prosodic gesture that are temporally aligned. 16 . The system of claim 15 , further comprising a speech to text module and wherein the apparatus is further configured to: convert, using the speech to text module, a received audio stream and/or a received language model into a text stream to create the associated text stream; and create, using the streams to script module, the at least one prosodic speech symbol and/or the at least one prosodic gesture symbol using stored, pre-determined symbols. 17 . The system of claim 15 , further comprising a rendering module and wherein the apparatus is further configured to: render, using the rendering module, the prosodic script to create at least one predicted audio stream and/or at least one predicted video stream; and compare, using the rendering module, prosodic speech features of the at least one predicted audio stream and/or prosodic gestures of the at

Assignees

Stanford Res Inst Int

Inventors

Classifications

G10L13/08
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title
G10L2021/105
Synthesis of the lips movements from speech, e.g. for talking heads · CPC title
G10L15/26Primary
Speech to text systems (G10L15/08 takes precedence) · CPC title
G10L15/25
using position of the lips, movement of the lips or face analysis · CPC title
G10L15/063
Training · CPC title

Patent family

Related publications grouped by family.

View patent family 91963785

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2024257801A1 cover?: A method, apparatus, and system for creating a script for rendering audio and/or video streams include identifying at least one prosodic speech feature in a received audio stream and/or a received language model, creating a respective prosodic speech symbol for each of the at least one identified prosodic speech features, converting the received audio stream and/or the received language model i…
Who is the assignee on this patent?: Stanford Res Inst Int
What technology area does this patent fall under?: Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Aug 01 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).