Server side hotwording
US-2024412734-A1 · Dec 12, 2024 · US
US2024257801A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2024257801-A1 |
| Application number | US-202318393575-A |
| Country | US |
| Kind code | A1 |
| Filing date | Dec 21, 2023 |
| Priority date | Feb 1, 2023 |
| Publication date | Aug 1, 2024 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method, apparatus, and system for creating a script for rendering audio and/or video streams include identifying at least one prosodic speech feature in a received audio stream and/or a received language model, creating a respective prosodic speech symbol for each of the at least one identified prosodic speech features, converting the received audio stream and/or the received language model into a text stream, temporally inserting the created at least one prosodic speech symbol into the text stream, identifying in a received video stream at least one prosodic gesture of at least a portion of a body of a speaker of the received audio stream, creating at least one respective gesture symbol for each of the at least one identified prosodic gestures, and temporally inserting the created at least one gesture symbol into the text stream along with the at least one prosodic speech symbol to create a prosodic script.
Opening claim text (preview).
1 . A method for creating a script for rendering audio and/or video streams, comprising: identifying at least one prosodic speech feature, in at least one of a received audio stream or a received language model, and/or at least one prosodic gesture in a received video stream; and automatically temporally annotating an associated text stream with at least one prosodic speech symbol created from the identified at least one prosodic speech feature and/or at least one prosodic gesture symbol created from the identified at least prosodic gesture to create a prosodic script, that when rendered, provides an audio stream and/or a video stream comprising the at least one prosodic speech feature and/or the at least one prosodic gesture that are temporally aligned. 2 . The method of claim 1 , further comprising: converting a received audio stream and/or language model into a text stream to create the associated text stream; and creating the at least one prosodic speech symbol and/or the at least one prosodic gesture symbol using stored, pre-determined symbols. 3 . The method of claim 1 , further comprising rendering the prosodic script to create at least one predicted audio stream and/or at least one predicted video stream; and comparing prosodic speech features of the at least one predicted audio stream and/or prosodic gestures of the at least one predicted video stream to prosodic speech features of a ground truth audio stream and/or prosodic gestures of a ground truth video stream to determine respective loss functions for training a system to create the prosodic script. 4 . The method of claim 1 , wherein the prosodic gestures are identified from movement of at least a portion of a body of a speaker of the received audio stream. 5 . The method of claim 4 , wherein the portion of a body of a speaker comprises a face of the speaker of the audio stream and that at least one prosodic gesture comprises a change in at least a portion of the face of the speaker, including at least one of a head, mouth, forehead, ears, chin, or eyes of the speaker of the received audio stream. 6 . The method of claim 1 , further comprising; creating a spectrogram of the received audio stream; rendering the spectrogram from the prosodic script to create a predicted spectrogram; and comparing the predicted spectrogram to the created spectrogram to determine a loss function for training a system to create the prosodic script. 7 . The method of claim 1 , wherein the at least one prosodic speech feature comprises at least one of an emphasis, a duration, or a pitch of a temporal portion of the received audio stream. 8 . An apparatus for creating a script for rendering audio and/or video streams, comprising: a processor; and a memory accessible to the processor, the memory having stored therein at least one of programs or instructions executable by the processor to configure the apparatus to: identify at least one prosodic speech feature, in at least one of a received audio stream or a received language model, and/or at least one prosodic gesture in a received video stream; and automatically temporally annotate an associated text stream with at least one prosodic speech symbol created from the identified at least one prosodic speech feature and/or at least one prosodic gesture symbol created from the identified at least prosodic gesture to create a prosodic script, that when rendered, provides an audio stream and/or a video stream comprising the at least one prosodic speech feature and/or the at least one prosodic gesture that are temporally aligned. 9 . The apparatus of claim 8 , wherein the apparatus is further configured to: convert a received audio stream and/or language model into a text stream to create the associated text stream; and create the at least one prosodic speech symbol and/or the at least one prosodic gesture symbol using stored, pre-determined symbols. 10 . The apparatus of claim 8 , wherein the apparatus is further configured to: render the prosodic script to create at least one predicted audio stream and/or at least one predicted video stream; and compare prosodic speech features of the at least one predicted audio stream and/or prosodic gestures of the at least one predicted video stream to prosodic speech features of a ground truth audio stream and/or prosodic gestures of a ground truth video stream to determine respective loss functions for training a system to create the prosodic script. 11 . The apparatus of claim 8 , wherein the prosodic gestures are identified from movement of at least a portion of a body of a speaker of the received audio stream. 12 . The apparatus of claim 11 , wherein the portion of a body of a speaker comprises a face of the speaker of the audio stream and that at least one prosodic gesture comprises a change in at least a portion of the face of the speaker, including at least one of a head, mouth, forehead, ears, chin, or eyes of the speaker of the received audio stream. 13 . The apparatus of claim 8 , wherein the apparatus is further configured to: create a spectrogram of the received audio stream; render the spectrogram from the prosodic script to create a predicted spectrogram; and compare the predicted spectrogram to the created spectrogram to determine a loss function for training a system to create the prosodic script. 14 . The apparatus of claim 8 , wherein the at least one prosodic speech feature comprises at least one of an emphasis, a duration, or a pitch of a temporal portion of the received audio stream. 15 . A system for creating a script for rendering audio and/or video streams, comprising: a spectral features module; a gesture features module; a streams to script module; and an apparatus comprising a processor and a memory accessible to the processor, the memory having stored therein at least one of programs or instructions executable by the processor to configure the apparatus to: identify, using the spectral features module and/or the gesture features module, at least one prosodic speech feature, in at least one of a received audio stream or a received language model, and/or at least one prosodic gesture in a received video stream; and automatically temporally annotate, using the streams to script module, an associated text stream with at least one prosodic speech symbol created from the identified at least one prosodic speech feature and/or at least one prosodic gesture symbol created from the identified at least prosodic gesture to create a prosodic script, that when rendered, provides an audio stream and/or a video stream comprising the at least one prosodic speech feature and/or the at least one prosodic gesture that are temporally aligned. 16 . The system of claim 15 , further comprising a speech to text module and wherein the apparatus is further configured to: convert, using the speech to text module, a received audio stream and/or a received language model into a text stream to create the associated text stream; and create, using the streams to script module, the at least one prosodic speech symbol and/or the at least one prosodic gesture symbol using stored, pre-determined symbols. 17 . The system of claim 15 , further comprising a rendering module and wherein the apparatus is further configured to: render, using the rendering module, the prosodic script to create at least one predicted audio stream and/or at least one predicted video stream; and compare, using the rendering module, prosodic speech features of the at least one predicted audio stream and/or prosodic gestures of the at
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title
Synthesis of the lips movements from speech, e.g. for talking heads · CPC title
Speech to text systems (G10L15/08 takes precedence) · CPC title
using position of the lips, movement of the lips or face analysis · CPC title
Training · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.