Machine learning collaboration techniques
US-2024420212-A1 · Dec 19, 2024 · US
US2025298981A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025298981-A1 |
| Application number | US-202418614354-A |
| Country | US |
| Kind code | A1 |
| Filing date | Mar 22, 2024 |
| Priority date | Mar 22, 2024 |
| Publication date | Sep 25, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Implementations are described herein for processing a stream of time-varying input data to generate/predict a stream of time-varying output data in real-time or near-real time. In various implementations, while a stream of input frames, such as a stream of audio input frames, is received, audio input frames received up to a current time step may be tokenized (e.g., midstream) to generate a stream of audio input tokens. A Transformer-based causal attention model may be used to predict a stream of audio output tokens, e.g., by iteratively applying the Transformer-based causal attention model to: at least some of the audio input tokens tokenized up to the current time step, and at least some of the audio output tokens predicted up to the current time step. The stream of audio output tokens may be detokenized to generate a stream of audio output frames.
Opening claim text (preview).
What is claimed is: 1 . A method implemented using one or more processors and comprising: receiving a stream of audio input frames; while the stream of audio input frames is received, tokenizing audio input frames received up to a current time step to generate a stream of audio input tokens; using a Transformer-based causal attention model to predict a stream of audio output tokens, wherein using the Transformer-based causal attention model comprises iteratively applying the Transformer-based causal attention model to: at least some of the audio input tokens tokenized up to the current time step, and at least some of the audio output tokens predicted up to the current time step; and detokenizing the stream of audio output tokens to generate a stream of audio output frames. 2 . The method of claim 1 , further comprising mixing audio input tokens of the stream of audio input tokens with at least some of the audio output tokens predicted up to the current time step to generate a mixed stream of audio tokens. 3 . The method of claim 2 , wherein the mixed stream of audio tokens is iteratively processed using the Transformer-based causal attention model. 4 . The method of claim 3 , wherein the mixing comprises interleaving audio input tokens of the stream of audio input tokens with at least some of the audio output tokens predicted up to the current time step. 5 . The method of claim 3 , wherein the Transformer-based causal attention model comprises a decoder-only transformer. 6 . The method of claim 1 , wherein the Transformer-based causal attention model comprises an encoder transformer and a decoder transformer operably coupled using cross attention. 7 . The method of claim 6 , wherein the decoder transformer attends to audio output tokens of the stream of audio output tokens. 8 . The method of claim 7 , wherein the encoder transformer attends to audio input tokens of the stream of audio input tokens. 9 . The method of claim 1 , wherein the Transformer-based causal attention model uses local attention. 10 . The method of claim 9 , further comprising adjusting a future context length of the local attention to add a controllable lookahead. 11 . The method of claim 1 , wherein the stream of audio input tokens includes at least some acoustic input tokens generated using a neural audio codec. 12 . The method of claim 11 , wherein the stream of audio input tokens further includes at least some semantic input tokens generated using a semantic tokenizer that is trained to capture, in the audio input tokens, semantic features of the audio input frames. 13 . The method of claim 12 , wherein each of the audio input tokens comprises both an acoustic input token and a semantic input token. 14 . The method of claim 13 , wherein each of the audio output tokens comprises both a predicted acoustic output token and a predicted semantic output token, and wherein the detokenizing comprises decoding the predicted acoustic output token, without decoding the predicted semantic output token. 15 . The method of claim 1 , wherein the Transformer-based causal attention model comprises a first model used to process the at least some of the audio input tokens tokenized up to the current time step at least some of the audio output tokens predicted up to the current time step to generate coarse acoustic tokens, and a second model to process the coarse acoustic tokens to generate fine acoustic tokens. 16 . The method of claim 12 , wherein the semantic features include one or more of: phonetic features of the audio input frames; prosodic features of the audio input frames; melodic features of the audio input frames; or rhythmic features of the audio input frames. 17 . The method of claim 1 , wherein during each iteration of the Transformer-based causal attention model, the Transformer-based causal attention model is applied to: a current audio state, wherein the current audio state was generated autoregressively based on one or more prior iterations of the Transformer-based causal attention model to prior audio input tokens; and one or more next audio input tokens. 18 . A system comprising one or more processors and memory storing instructions that, in response to execution by the one or more processors, cause the one or more processors to: receive a stream of audio input frames; while the stream of audio input frames is received, tokenize audio input frames received up to a current time step to generate a stream of audio input tokens; use a Transformer-based causal attention model to predict a stream of audio output tokens, wherein the instructions to use the Transformer-based causal attention model include instructions to iteratively apply the Transformer-based causal attention model to: at least some of the audio input tokens tokenized up to the current time step, and at least some of the audio output tokens predicted up to the current time step; and detokenize the stream of audio output tokens to generate a stream of audio output frames. 19 . The system of claim 18 , further comprising instructions to mix audio input tokens of the stream of audio input tokens with at least some of the audio output tokens predicted up to the current time step to generate a mixed stream of audio tokens. 20 . At least one non-transitory computer-readable medium comprising instructions that, in response to execution by one or more processors, cause the one or more processors to: receive a stream of audio input frames; while the stream of audio input frames is received, tokenize audio input frames received up to a current time step to generate a stream of audio input tokens; use a Transformer-based causal attention model to predict a stream of audio output tokens, wherein the instructions to use the Transformer-based causal attention model include instructions to iteratively apply the Transformer-based causal attention model to: at least some of the audio input tokens tokenized up to the current time step, and at least some of the audio output tokens predicted up to the current time step; and detokenize the stream of audio output tokens to generate a stream of audio output frames.
based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title
Generative networks · CPC title
Phonemes, fenemes or fenones being the recognition units · CPC title
Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.