Fast Emit Low-latency Streaming ASR with Sequence-level Emission Regularization
US-2022122586-A1 · Apr 21, 2022 · US
US12488798B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-12488798-B1 |
| Application number | US-202217852552-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jun 29, 2022 |
| Priority date | Jun 8, 2022 |
| Publication date | Dec 2, 2025 |
| Grant date | Dec 2, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A first neural network (NN) model may generate labels for training a second NN model. The second NN model may represent instances of a NN model operating on multiple different devices (e.g., decentralized user and/or edge devices). The system may include using a “teacher” model to process data received by one or more of the devices to generate a labeled dataset. The system may use the labeled dataset and a “student” model to calculate gradient data for updating the student model. The student model may be the same or similar to NN model instances operating on the devices. The system may validate the updated student model to determine, for example, whether it exhibits improved performance when processing the newly received data and/or historical data. The system may distribute the validated update to the devices.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method comprising: receiving, by a system component from a user device separate from the system component, first audio data representing an utterance, wherein the user device receives the first audio data and processes the first audio data using a first neural network model to generate first automatic speech recognition (ASR) data representing a first transcript of the utterance; processing the first audio data to determine first feature data, the first feature data representing normalized log-filterbank energies of frames of the first audio data; processing, by the system component using a second neural network model different from the first neural network model, the first audio data to determine second ASR data representing a second transcript of the utterance; determining, based on at least the second ASR data, to include the first feature data and the second ASR data in a first labeled dataset for updating the first neural network model, the first labeled dataset additionally including second feature data and third ASR data determined using second audio data; determining, by the system component using the first labeled dataset and a third neural network model different from the second neural network model, first gradient data representing gradients calculated for updating the third neural network model using the first labeled dataset, the third neural network model representing a duplicate of the first neural network model; determining, using the first gradient data, first model update data, the first model update data additionally representing second gradient data determined using a second labeled dataset; sending, from the system component to the user device, the first model update data; causing the user device to generate an updated first neural network model using the first model update data; and causing the user device to process third audio data, received by the user device, using the updated first neural network model to generate fourth ASR data. 2 . The computer-implemented method of claim 1 , further comprising: receiving third data representing a confidence that the second ASR data represents an accurate transcript of the utterance; and determining that the third data satisfies a condition, wherein determining to include the first feature data and the second ASR data in the first labeled dataset is additionally based on determining that the third data satisfies the condition. 3 . The computer-implemented method of claim 1 , further comprising: processing a third labeled dataset using the third neural network model to determine a first word error rate; determining a fourth neural network model using the third neural network model and the first model update data, the fourth neural network model representing an update of the third neural network model based on the first model update data; processing the third labeled dataset using the fourth neural network model to determine a second word error rate; and determining that the second word error rate is less than the first word error rate, wherein causing the user device to generate the updated first neural network model is based at least in part on determining that the second word error rate is less than the first word error rate. 4 . The computer-implemented method of claim 1 , wherein the first gradient data is determined using first training parameters, the method further comprising: determining a fourth neural network model using the third neural network model and the first model update data, the fourth neural network model representing an update of the third neural network model based on the first model update data; processing a third labeled dataset using the fourth neural network model to determine a first word error rate; determining, using the first labeled dataset and second training parameters different from the first training parameters, second model update data; determining a fifth neural network model using the third neural network model and the second model update data, the fifth neural network model representing an update of the third neural network model based on the second model update data; processing the third labeled dataset using the fifth neural network model to determine a second word error rate; and determining that the first word error rate is less than the second word error rate, wherein causing the user device to generate an updated first neural network model using the first model update data is based on determining that the first word error rate is less than the second word error rate. 5 . A computer-implemented method comprising: receiving, by one or more system components from a user device, first audio data representing an utterance captured using a microphone of the user device, wherein the user device processes the first audio data using a first machine learning model to generate first output data representing a first transcript of the utterance; processing the first audio data using a second machine learning model different from the first machine learning model to determine second output data representing a second transcript of the utterance; determining, based on at least the second output data, to include the second output data in first data representing a portion of a first labeled dataset for updating the first machine learning model; determining, by the one or more system components using the first data and a third machine learning model different from the second machine learning model, second data representing first gradients calculated for updating the third machine learning model using the first labeled dataset; determining, using the second data, first model update data, the first model update data additionally representing second gradients determined using a second labeled dataset; sending, from the one or more system components to the user device, the first model update data; causing the user device to generate an updated first machine learning model using the first model update data; and causing the user device to process second audio data, received by the microphone, using the updated first machine learning model to generate third output data. 6 . The computer-implemented method of claim 5 , further comprising: receiving third data representing a confidence that the second output data represents an accurate transcript of the utterance; and determining that the third data satisfies a condition, wherein determining to include the second output data in the first data is additionally based on determining that the third data satisfies the condition. 7 . The computer-implemented method of claim 5 , further comprising: processing a third labeled dataset using the third machine learning model to determine a first performance metric; determining a fourth machine learning model using the third machine learning model and the first model update data, a fourth machine learning model representing an update of the third machine learning model; processing the third labeled dataset using the fourth machine learning model to determine a second performance metric; and determining, using the first performance metric and the second performance metric, to cause the user device to generate an updated first machine learning model using the first model update data. 8 . The computer-implemented method of claim 5 , wherein the second data is determined using first training parameters, the method further comprising: determining a fourth machine learning model using the third machine learning model and the second data; processing a third labeled dataset using the fourth machine learning model to determine a first performance metric; determining, using the first data a
Non-supervised learning, e.g. competitive learning · CPC title
Learning methods · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.