Speaker conversion for video games
US-11605388-B1 · Mar 14, 2023 · US
US12296265B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-12296265-B1 |
| Application number | US-202418407686-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jan 9, 2024 |
| Priority date | Nov 20, 2020 |
| Publication date | May 13, 2025 |
| Grant date | May 13, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
This specification describes a computer-implemented method of generating context-dependent speech audio in a video game. The method comprises obtaining contextual information relating to a state of the video game. The contextual information is inputted into a prosody prediction module. The prosody prediction module comprises a trained machine learning model which is configured to generate predicted prosodic features based on the contextual information. Input data comprising the predicted prosodic features and speech content data associated with the state of the video game is inputted into a speech audio generation module. An encoded representation of the speech content data dependent on the predicted prosodic features is generated using one or more encoders of the speech audio generation module. Context-dependent speech audio is generated, based on the encoded representation, using a decoder of the speech audio generation module.
Opening claim text (preview).
The invention claimed is: 1. A computer-implemented method of generating context-dependent speech audio in a video game, the method comprising: enabling, by at least one processor of a computing device, gameplay of the video game; determining, by a video game engine of the video game on the at least one processor, an in-game event for which context-dependent speech audio is to be generated during the gameplay of the video game, wherein the in-game event includes an action performed by a character of the video game; obtaining, by the video game engine of the video game, contextual information and speech content data relating to a current state of the gameplay; requesting, by the video game engine of the video game, the context-dependent speech audio from a speech audio generator of the video game; generating, by the speech audio generator responsive to the request, the context-dependent speech audio by: inputting the contextual information relating to the current state of the gameplay into a prosody prediction model, wherein the prosody prediction model comprises a trained machine learning model which is configured to generate predicted prosodic features based on the contextual information; generating, by the prosody prediction model, predicted prosodic features from the input contextual information; inputting, into a speech audio generation model, input data comprising: at least the predicted prosodic features; and the speech content data relating to the current state of the gameplay; generating, using one or more encoders of the speech audio generation model, an encoded representation of the speech content data dependent on the predicted prosodic features; decoding, using a decoder of the speech audio generation model, the encoded representation to generate the context-dependent speech audio; and causing, by the video game engine of the video game, the context-dependent speech audio that matches the current state of the video game to be played among the gameplay of the in-game event. 2. The computer-implemented method of claim 1 , wherein the one or more encoders comprise a prosody encoder configured to generate an encoded representation of the predicted prosodic features, and a speech content encoder configured to generate the encoded representation of the speech content data based on the encoded representation of the predicted prosodic features. 3. The computer-implemented method of claim 1 , wherein the video game is a sports video game, wherein obtaining the contextual information relating to the current state of the video game comprises determining contextual information relating to an in-progress match of the sports video game. 4. The computer-implemented method of claim 3 , wherein the contextual information relating to the in-progress match of the sports video game comprises determining one or more of: statistics relating to one or more teams playing in the match; statistics relating to one or more players playing in the match; statistics relating to a current status of the match; and the type of sport being played in the match. 5. The computer-implemented method of claim 1 , wherein the contextual information includes the speech content data associated with the current state of the video game. 6. The computer-implemented method of claim 1 , wherein the input data further comprises speaker identifier data for a speaker of the generated speech audio. 7. A non-transitory computer-readable medium containing instructions, which when executed by one or more processors, causes the one or more processors to perform a method comprising: enabling, by at least one processor of a computing device, gameplay of a video game; determining, by a video game engine of the video game on the at least one processor, an in-game event for which context-dependent speech audio is to be generated during the gameplay of the video game, wherein the in-game event includes an action performed by a character of the video game; obtaining, by the video game engine of the video game, contextual information and speech content data relating to a current state of the gameplay; requesting, by the video game engine of the video game, the context-dependent speech audio from a speech audio generator of the video game; generating, by the speech audio generator responsive to the request, the context-dependent speech audio by: inputting the contextual information relating to the current state of the gameplay into a prosody prediction model, wherein the prosody prediction model comprises a trained machine learning model which is configured to generate predicted prosodic features based on the contextual information; generating, by the prosody prediction model, predicted prosodic features from the input contextual information; inputting, into a speech audio generation model, input data comprising: at least the predicted prosodic features; and the speech content data relating to the current state of the gameplay; generating, using one or more encoders of the speech audio generation model, an encoded representation of the speech content data dependent on the predicted prosodic features; decoding, using a decoder of the speech audio generation model, the encoded representation to generate the context-dependent speech audio; and causing, by the video game engine of the video game, the context-dependent speech audio that matches the current state of the video game to be played among the gameplay of the in-game event. 8. The non-transitory computer-readable medium of claim 7 , wherein the speech audio generation model includes a synthesizer. 9. The non-transitory computer-readable medium of claim 8 , wherein the speech content data comprises a plurality of speech content segments at a plurality of respective time steps and wherein inputting, into the speech audio generation model, the input data comprising the predicted prosodic features and the speech content data comprises generating, as output of a speech content encoder of the synthesizer, a speech content encoding for each time step of one or more time steps of the speech content data. 10. The non-transitory computer-readable medium of claim 9 , wherein generating predicted prosodic features comprises generating predicted prosodic features for each time step of the one or more time steps of the speech content data. 11. The non-transitory computer-readable medium of claim 10 , wherein inputting, into the speech audio generator, the input data comprising the predicted prosodic features and the speech content data comprises combining, for each time step of the one or more time steps, the speech content encoding and the predicted prosodic features of the time step. 12. A computer-implemented method of generating context-dependent speech audio in a video game, the method comprising: enabling, by at least one processor of a computing device, gameplay of the video game comprising requesting, by the at least one processor of the computing device, video game content from a video game server while a user is playing the video game; determining, by a video game engine of the video game on the at least one processor, an in-game event for which context-dependent speech audio is to be generated during the gameplay of the video game; obtaining, by the video game engine of the video game, contextual information and speech content data relating to a current state of the gameplay; requesting, by the video game engine of the video game, the context-dependent speech audio from a speech audio generator of the video game; generating, by the speech audio generator responsive to the request, the context-dependent speech audio based upon processing
Voice editing, e.g. manipulating the voice of the synthesiser · CPC title
Sound input; Sound output (speech processing G10L) · CPC title
involving acoustic signals, e.g. for simulating revolutions per minute [RPM] dependent engine sounds in a driving game or reverberation against a virtual wall · CPC title
generating an output signal, e.g. under timing constraints, for spatialization · CPC title
Learning methods · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.