Midstream processing of streaming input to generate streaming output

US2025298981A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025298981-A1
Application numberUS-202418614354-A
CountryUS
Kind codeA1
Filing dateMar 22, 2024
Priority dateMar 22, 2024
Publication dateSep 25, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Implementations are described herein for processing a stream of time-varying input data to generate/predict a stream of time-varying output data in real-time or near-real time. In various implementations, while a stream of input frames, such as a stream of audio input frames, is received, audio input frames received up to a current time step may be tokenized (e.g., midstream) to generate a stream of audio input tokens. A Transformer-based causal attention model may be used to predict a stream of audio output tokens, e.g., by iteratively applying the Transformer-based causal attention model to: at least some of the audio input tokens tokenized up to the current time step, and at least some of the audio output tokens predicted up to the current time step. The stream of audio output tokens may be detokenized to generate a stream of audio output frames.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method implemented using one or more processors and comprising: receiving a stream of audio input frames; while the stream of audio input frames is received, tokenizing audio input frames received up to a current time step to generate a stream of audio input tokens; using a Transformer-based causal attention model to predict a stream of audio output tokens, wherein using the Transformer-based causal attention model comprises iteratively applying the Transformer-based causal attention model to: at least some of the audio input tokens tokenized up to the current time step, and at least some of the audio output tokens predicted up to the current time step; and detokenizing the stream of audio output tokens to generate a stream of audio output frames. 2 . The method of claim 1 , further comprising mixing audio input tokens of the stream of audio input tokens with at least some of the audio output tokens predicted up to the current time step to generate a mixed stream of audio tokens. 3 . The method of claim 2 , wherein the mixed stream of audio tokens is iteratively processed using the Transformer-based causal attention model. 4 . The method of claim 3 , wherein the mixing comprises interleaving audio input tokens of the stream of audio input tokens with at least some of the audio output tokens predicted up to the current time step. 5 . The method of claim 3 , wherein the Transformer-based causal attention model comprises a decoder-only transformer. 6 . The method of claim 1 , wherein the Transformer-based causal attention model comprises an encoder transformer and a decoder transformer operably coupled using cross attention. 7 . The method of claim 6 , wherein the decoder transformer attends to audio output tokens of the stream of audio output tokens. 8 . The method of claim 7 , wherein the encoder transformer attends to audio input tokens of the stream of audio input tokens. 9 . The method of claim 1 , wherein the Transformer-based causal attention model uses local attention. 10 . The method of claim 9 , further comprising adjusting a future context length of the local attention to add a controllable lookahead. 11 . The method of claim 1 , wherein the stream of audio input tokens includes at least some acoustic input tokens generated using a neural audio codec. 12 . The method of claim 11 , wherein the stream of audio input tokens further includes at least some semantic input tokens generated using a semantic tokenizer that is trained to capture, in the audio input tokens, semantic features of the audio input frames. 13 . The method of claim 12 , wherein each of the audio input tokens comprises both an acoustic input token and a semantic input token. 14 . The method of claim 13 , wherein each of the audio output tokens comprises both a predicted acoustic output token and a predicted semantic output token, and wherein the detokenizing comprises decoding the predicted acoustic output token, without decoding the predicted semantic output token. 15 . The method of claim 1 , wherein the Transformer-based causal attention model comprises a first model used to process the at least some of the audio input tokens tokenized up to the current time step at least some of the audio output tokens predicted up to the current time step to generate coarse acoustic tokens, and a second model to process the coarse acoustic tokens to generate fine acoustic tokens. 16 . The method of claim 12 , wherein the semantic features include one or more of: phonetic features of the audio input frames; prosodic features of the audio input frames; melodic features of the audio input frames; or rhythmic features of the audio input frames. 17 . The method of claim 1 , wherein during each iteration of the Transformer-based causal attention model, the Transformer-based causal attention model is applied to: a current audio state, wherein the current audio state was generated autoregressively based on one or more prior iterations of the Transformer-based causal attention model to prior audio input tokens; and one or more next audio input tokens. 18 . A system comprising one or more processors and memory storing instructions that, in response to execution by the one or more processors, cause the one or more processors to: receive a stream of audio input frames; while the stream of audio input frames is received, tokenize audio input frames received up to a current time step to generate a stream of audio input tokens; use a Transformer-based causal attention model to predict a stream of audio output tokens, wherein the instructions to use the Transformer-based causal attention model include instructions to iteratively apply the Transformer-based causal attention model to: at least some of the audio input tokens tokenized up to the current time step, and at least some of the audio output tokens predicted up to the current time step; and detokenize the stream of audio output tokens to generate a stream of audio output frames. 19 . The system of claim 18 , further comprising instructions to mix audio input tokens of the stream of audio input tokens with at least some of the audio output tokens predicted up to the current time step to generate a mixed stream of audio tokens. 20 . At least one non-transitory computer-readable medium comprising instructions that, in response to execution by one or more processors, cause the one or more processors to: receive a stream of audio input frames; while the stream of audio input frames is received, tokenize audio input frames received up to a current time step to generate a stream of audio input tokens; use a Transformer-based causal attention model to predict a stream of audio output tokens, wherein the instructions to use the Transformer-based causal attention model include instructions to iteratively apply the Transformer-based causal attention model to: at least some of the audio input tokens tokenized up to the current time step, and at least some of the audio output tokens predicted up to the current time step; and detokenize the stream of audio output tokens to generate a stream of audio output frames.

Assignees

Inventors

Classifications

  • based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title

  • Generative networks · CPC title

  • Phonemes, fenemes or fenones being the recognition units · CPC title

  • Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025298981A1 cover?
Implementations are described herein for processing a stream of time-varying input data to generate/predict a stream of time-varying output data in real-time or near-real time. In various implementations, while a stream of input frames, such as a stream of audio input frames, is received, audio input frames received up to a current time step may be tokenized (e.g., midstream) to generate a stre…
Who is the assignee on this patent?
Deepmind Tech Ltd
What technology area does this patent fall under?
Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Sep 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).