Server side hotwording
US-2024412734-A1 · Dec 12, 2024 · US
US2025149027A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025149027-A1 |
| Application number | US-202418903676-A |
| Country | US |
| Kind code | A1 |
| Filing date | Oct 1, 2024 |
| Priority date | Nov 7, 2023 |
| Publication date | May 8, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method of processing speech includes: obtaining a speech input; obtaining an instruction related to the speech input; obtaining a speech representation corresponding to the speech input; obtaining an adapter that includes speech information by fusing a pre-trained adapter with the speech representation; and obtaining a response corresponding to the instruction by inputting both the adapter that includes the speech information and the instruction to a language model, the language model generating the response based on the adapter that includes the speech model and the speech information.
Opening claim text (preview).
What is claimed is: 1 . A method of processing speech, the method comprising: obtaining a speech input; obtaining an instruction related to the speech input; obtaining a speech representation corresponding to the speech input; obtaining an adapter that includes speech information by fusing a pre-trained adapter with the speech representation; and obtaining a response corresponding to the instruction by inputting both the adapter that includes the speech information and the instruction to a language model. 2 . The method of claim 1 , wherein the adapter that includes the speech information and the pre-trained adapter have a same length. 3 . The method of claim 1 , wherein the obtaining of the adapter that includes the speech information comprises: inputting the pre-trained adapter as a query of a multi-head attention; inputting the speech representation as a key-value to the multi-head attention; and determining an output of the multi-head attention to be the adapter that includes the speech information. 4 . The method of claim 1 , wherein the pre-trained adapter has a fixed-length, and the speech representation has a variable length. 5 . The method of claim 1 , wherein the speech representation is obtained by inputting the speech input to a speech encoder. 6 . The method of claim 1 , wherein the response comprises, with respect to the speech input, a speech recognition, a speech emotion recognition, a speaker recognition, a speech translation, or colloquial language understanding related to the speech input. 7 . A training method comprising: obtaining a speech input; obtaining an instruction generated based on the speech input and a labeled response; obtaining a speech representation corresponding to the speech input; obtaining an adapter that includes speech information by fusing an adapter and the speech representation; obtaining a response corresponding to the instruction by inputting the adapter that includes the speech information and the instruction to a language model that generates the response corresponding to the instruction; and training the adapter based on the labeled response and the response corresponding to the instruction. 8 . The training method of claim 7 , wherein the speech representation is obtained by inputting the speech input to a speech encoder that generates the speech representation, the speech representation representing features of the speech input. 9 . The training method of claim 8 , wherein the language model and the speech encoder are pre-trained models. 10 . The training method of claim 7 , wherein the adapter that includes the speech information and the adapter fused with the speech representation have a same length. 11 . The training method of claim 7 , wherein the obtaining of the adapter that includes the speech information comprises: inputting the adapter as a query of multi-head attention; inputting the speech representation as a key-value of the multi-head attention; and determining an output of the multi-head attention to serve as the adapter that includes the speech information. 12 . The training method of claim 7 , wherein the adapter has a fixed-length, and the speech representation has a variable length. 13 . A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method of claim 1 . 14 . A device for processing speech, the device comprising: a speech encoder configured to receive a speech input and configured to encode the speech input as a speech representation corresponding to the speech input; a fusion model configured to output an adapter that includes speech information by fusing a pre-trained adapter and the speech representation; and a language model configured to receive an instruction related to the speech input and the adapter including the speech information, the language model configured to output a response corresponding to the instruction. 15 . The device of claim 14 , wherein the adapter including the speech information and the pre-trained adapter have a same length. 16 . The device of claim 14 , wherein the fusion model is configured to: receive the pre-trained adapter as a query of a multi-head attention; receive the speech representation as a key-value of the multi-head attention; and output an output of the multi-head attention to the adapter comprising the speech information. 17 . The device of claim 14 , wherein the pre-trained adapter has a fixed-length, and the speech representation has a variable length.
Feedback of the input speech · CPC title
Execution procedure of a spoken command · CPC title
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
for estimating an emotional state · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.