Automatic speech-based longitudinal emotion and mood recognition for mental health treatment
US-11545173-B2 · Jan 3, 2023 · US
US11854540B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11854540-B2 |
| Application number | US-202117301489-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 5, 2021 |
| Priority date | Jan 8, 2021 |
| Publication date | Dec 26, 2023 |
| Grant date | Dec 26, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A device may receive text data, audio data, and video data associated with a user, and may process the received data, with a first model, to determine a stress level of the user. The device may process the received data, with second models, to determine depression levels of the user, and may combine the depression levels to identify an overall depression level. The device may process the received data, with a third model, to determine a continuous affect prediction, and may process the received data, with a fourth model, to determine an emotion of the user. The device may process the received data, with a fifth model, to determine a response to the user, and may utilize a sixth model to determine a context for the response. The device may utilize seventh models to generate contextual conversation data, and may perform actions based on the contextual conversational data.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: receiving, by a device and from a user device, text data identifying text input by a user of the user device, audio data identifying audio associated with the user, and video data identifying a video associated with the user; processing, by the device, the text data, the audio data, and the video data, with a support vector machine model, to determine a stress level of the user; processing, by the device, the text data, the audio data, and the video data, with different regression models, to determine a first depression level of the user based on the text data, a second depression level of the user based on the audio data, and a third depression level of the user based on the video data; combining, by the device, the first depression level, the second depression level, and the third depression level to identify an overall depression level of the user; processing, by the device, the text data, the audio data, and the video data, with a deep learning convolutional neural network model, to determine a continuous affect prediction for the user; processing, by the device, the text data, the audio data, and the video data, with a classifier model, to determine an emotion of the user; processing, by the device, the text data, the audio data, and the video data, with a generative pretrained transformer language model, to determine a response to the user; utilizing, by the device, a plug and play language model to determine a context for the response, based on the response, the stress level, the overall depression level, the continuous affect prediction, and the emotion; utilizing, by the device, one or more dialog manager models to generate contextual conversation data, based on the text data, the audio data, the video data, the response, and the context; and performing, by the device, one or more actions based on the contextual conversational data. 2. The method of claim 1 , wherein processing the text data, the audio data, and the video data, with the support vector machine model, to determine the stress level of the user comprises: determining a first stress level of the user based on the text input by the user, as provided in the text data; determining a second stress level of the user based on an intonation of a voice of the user, a rhythm of the voice, a pitch of the voice, an intensity of the voice, a loudness of the voice, and a jitter of the voice, as provided in the audio data; determining a third stress level of the user based on a head pose of the user, an eye gaze of the user, and an intensity of a facial muscle contraction of the user, as provided in the video data; and combining the first stress level, the second stress level, and the third stress level to determine the stress level of the user. 3. The method of claim 1 , wherein processing the text data, the audio data, and the video data, with the different regression models, to determine the first depression level of the user based on the text data, the second depression level of the user based on the audio data, and the third depression level of the user based on the video data comprises: processing the text data, with a first regression model, to determine the first depression level of the user; processing the audio data, with a second regression model, to determine the second depression level of the user; and processing the video data, with a third regression model, to determine the third depression level of the user. 4. The method of claim 1 , wherein combining the first depression level, the second depression level, and the third depression level to identify the overall depression level of the user comprises: assigning a first weight to the first depression level to generate a first weighted depression level; assigning a second weight to the second depression level to generate a second weighted depression level; assigning a third weight to the third depression level to generate a third weighted depression level; and aggregating the first weighted depression level, the second weighted depression level, and the third weighted depression level to identify the overall depression level of the user. 5. The method of claim 1 , wherein the continuous affect prediction for the user includes an arousal prediction for the user and a valence prediction for the user. 6. The method of claim 1 , wherein the deep learning convolutional neural network model includes a multi-modal sequence-to-sequence model. 7. The method of claim 1 , wherein the classifier model includes a random forest classifier model, and wherein the emotion of the user includes one or more of happiness, sadness, anger, surprise, neutral, contempt, fear, or disgust. 8. A device, comprising: one or more memories; and one or more processors, communicatively coupled to the one or more memories, configured to: receive, from a user device, text data identifying text input by a user of the user device, audio data identifying audio associated with the user, and video data identifying a video associated with the user; process the text data, the audio data, and the video data, with a support vector machine model, to determine a stress level of the user; process the text data, the audio data, and the video data, with different regression models, to determine a first depression level of the user based on the text data, a second depression level of the user based on the audio data, and a third depression level of the user based on the video data; assign weights to the first depression level, the second depression level, and the third depression level to generate a first weighted depression level, a second weighted depression level, and a third weighted depression level; aggregate the first weighted depression level, the second weighted depression level, and the third weighted depression level to identify an overall depression level of the user; process the text data, the audio data, and the video data, with a deep learning convolutional neural network model, to determine a continuous affect prediction for the user; process the text data, the audio data, and the video data, with a classifier model, to determine an emotion of the user; process the text data, the audio data, and the video data, with a generative pretrained transformer language model, to determine a response to the user; utilize a plug and play language model to determine a context for the response, based on the response, the stress level, the overall depression level, the continuous affect prediction, and the emotion; utilize one or more dialog manager models to generate contextual conversation data, based on the text data, the audio data, the video data, the response, and the context; and perform one or more actions based on the contextual conversational data. 9. The device of claim 8 , wherein the generative pretrained transformer language model includes a sentiment portion that is trained based on an emotion class and by applying a cross-entropy loss to the sentiment portion. 10. The device of claim 8 , wherein the plug and play language model includes a language model and an attribute model, and wherein the one or more processors, when utilizing the plug and play language model to determine the context for the response, are configured to: process the response, the stress level, the overall depression level, the continuous affect prediction, and the emotion, with the attribute model, to determine attributes and gradients; perform a forward pass with the language model, of the plug and play language model, to compute a likelihood of the attribute; perform a backward pass with the language model, of the plug and play language model, to update internal lat
Supervised learning · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.