Knowledge enhanced spoken dialog system
US-2021343288-A1 · Nov 4, 2021 · US
US12119008B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12119008-B2 |
| Application number | US-202217655441-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 18, 2022 |
| Priority date | Mar 18, 2022 |
| Publication date | Oct 15, 2024 |
| Grant date | Oct 15, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems, computer-implemented methods, and computer program products to facilitate end to end integration of dialogue history for spoken language understanding are provided. According to an embodiment, a system can comprise a processor that executes components stored in memory. The computer executable components comprise a conversation component that encodes speech-based content of an utterance and text-based content of the utterance into a uniform representation.
Opening claim text (preview).
What is claimed is: 1. A system comprising: a memory that stores computer executable components; a processor that executes at least one of the computer executable components that: trains a hierarchical conversational neural network model to generate spoken language understandings directly of speech dialogs in an audio modality without converting the speech dialogs to a text modality, wherein the training comprises: encoding, using a text encoder of the hierarchical conversational neural network model, utterances of a training speech dialog in the audio modality, converted into the text modality, into first embeddings in a uniform embedding representation; encoding, using a speech encoder of the hierarchical conversational neural network model, the utterances of the training speech dialog in the audio modality, without being converted into the text modality, into second embeddings in the uniform embedding representation; and training, using the first embeddings of the utterances in the text modality and the second embeddings of the utterances in the audio modality, a conversation encoder of the hierarchical conversational neural network model to generate a spoken language understanding of the training speech dialog in the audio modality without converting the training speech dialog to the text modality. 2. The system of claim 1 , wherein the training of the hierarchical conversational neural network model further comprises: training, using the first embeddings of the utterances in the text modality, the speech encoder, to encode utterances of the speech dialogs in the audio modality into the second embeddings in the uniform embedding representation. 3. The system of claim 1 , wherein the training of the hierarchical conversational neural network model further comprises: using one or more cross-model loss functions to transfer semantic knowledge from the text encoder to at least one of the speech encoder or the conversation encoder. 4. The system of claim 1 , wherein the at least one of the computer executable components further: generates, using the hierarchical conversational neural network model, a spoken language understanding of at least a portion of a speech dialog in the audio modality without converting the speech dialog to the text modality. 5. The system of claim 1 , wherein the speech encoder is a student in a student-teacher joint training framework; and wherein the text encoder is a teacher in the student-teacher joint training framework. 6. The system of claim 1 , wherein the training of the hierarchical conversational neural network model further comprises dropping one or more speech frames of the training speech dialog during the training based on a hyperparameter. 7. The system of claim 3 , wherein the one or more cross-model loss functions comprise at least one of a Euclidean loss function or a Contrastive loss function. 8. A computer-implemented method comprising: training, by a system operatively coupled to a processor, a hierarchical conversational neural network model to generate spoken language understandings directly of speech dialogs in an audio modality without converting the speech dialogs to a text modality, wherein the training comprises: encoding, using a text encoder of the hierarchical conversational neural network model, utterances of a training speech dialog in the audio modality, converted into the text modality, into first embeddings in a uniform embedding representation; encoding, using a speech encoder of the hierarchical conversational neural network model, the utterances of the training speech dialog in the audio modality, without being converted into the text modality, into second embeddings in the uniform embedding representation; and training, using the first embeddings of the utterances in the text modality and the second embeddings of the utterances in the audio modality, a conversation encoder of the hierarchical conversational neural network model to generate a spoken language understanding of the training speech dialog in the audio modality without converting the training speech dialog to the text modality. 9. The computer-implemented method of claim 8 , wherein the training of the hierarchical conversational neural network model further comprises: training, using the first embeddings of the utterances in the text modality, the speech encoder, to encode utterances of the speech dialogs in the audio modality into the second embeddings in the uniform embedding representation. 10. The computer-implemented method of claim 8 , wherein the training of the hierarchical conversational neural network model further comprises: using one or more cross-model loss functions to transfer semantic knowledge from the text encoder to at least one of the speech encoder or the conversation encoder. 11. The computer-implemented method of claim 8 , further comprising: generating, by the system, using the hierarchical conversational neural network model, a spoken language understanding of at least a portion of a speech dialog in the audio modality without converting the speech dialog to the text modality. 12. The computer-implemented method of claim 8 , wherein the speech encoder is a student in a student-teacher joint training framework, and the text encoder is a teacher in the student-teacher joint training framework. 13. The computer-implemented method of claim 8 , wherein the training of the hierarchical conversational neural network model further comprises: dropping one or more speech frames of the training speech dialog during the training based on a hyperparameter. 14. The computer-implemented method of claim 10 , wherein the one or more cross-model loss functions comprise at least one of a Euclidean loss function or a Contrastive loss function. 15. A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to: train a hierarchical conversational neural network model to generate spoken language understandings directly of speech dialogs in an audio modality without converting the speech dialogs to a text modality, wherein the training comprises: encoding, using a text encoder of the hierarchical conversational neural network model, utterances of a training speech dialog in the audio modality, converted into the text modality, into first embeddings in a uniform embedding representation; encoding, using a speech encoder of the hierarchical conversational neural network model, the utterances of the training speech dialog in the audio modality, without being converted into the text modality, into second embeddings in the uniform embedding representation; and training, using the first embeddings of the utterances in the text modality and the second embeddings of the utterances in the audio modality, a conversation encoder of the hierarchical conversational neural network model to generate a spoken language understanding of the training speech dialog in the audio modality without converting the training speech dialog to the text modality. 16. The computer program product of claim 15 , wherein the training of the hierarchical conversational neural network model further comprises: training, using the first embeddings of the utterances in the text modality, the speech encoder, to encode utterances of the speech dialogs in the audio modality into the second embeddings in the uniform embedding representation. 17. The computer program product of claim 15 , wherein the training of the hierarchica
Combinations of networks · CPC title
Speech recognition (G10L17/00 takes precedence) · CPC title
Character encoding · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Transfer learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.