Machine learning based emotion prediction and forecasting in conversation

US12380915B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12380915-B2
Application numberUS-202218071884-A
CountryUS
Kind codeB2
Filing dateNov 30, 2022
Priority dateNov 30, 2022
Publication dateAug 5, 2025
Grant dateAug 5, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and system for emotion recognition and forecasting are disclosed. The method may include obtaining an audio data of a conversation involving a plurality of speakers and identifying a plurality of turns of the conversation from the plurality of utterances. The method may further include extracting audio embedding features from the plurality of turns, obtaining a plurality of text segments associated with the audio data, extracting text embedding features from the plurality of text segments, obtaining and concatenating speaker embedding features associated with the audio data, obtaining and concatenating a plurality of emotion features corresponding to the plurality of turns. The method further comprises executing a tree-based prediction model to predict emotion features of the plurality of speakers for a subsequent turn of the ongoing conversation based on the audio embedding features, text embedding features, the concatenated speaker embedding features, and the concatenated emotion features.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for emotion recognition and forecasting in conversations comprising: obtaining, with a processor circuitry, an audio data of a conversation involving a plurality of speakers, the audio data comprising a plurality of utterances of the speakers; identifying a plurality of turns of the conversation from the plurality of utterances, a turn representing a temporal speech window unit for analyzing emotion features of the speakers; extracting, with the processor circuitry, audio embedding features from the plurality of turns; obtaining, with the processor circuitry, a plurality of text segments associated with the audio data; extracting, with the processor circuitry, text embedding features from the plurality of text segments; obtaining, with the processor circuitry, speaker embedding features associated with the audio data; concatenating, with the processor circuitry, the speaker embedding features; obtaining, with the processor circuitry, a plurality of emotion features corresponding to the plurality of turns, the plurality of emotion features indicating temporal emotion dynamics of the speakers over the turns; concatenating, with the processor circuitry, the plurality of emotion features; and executing, with the processor circuitry, a tree-based prediction model to predict emotion features of the plurality of speakers for a subsequent turn of the conversation based on the audio embedding features, text embedding features, the concatenated speaker embedding features, and the concatenated emotion features, wherein the tree-based prediction model comprises multiple layers of stacked ensemble models, and wherein the multiple layers of stacked ensemble models comprise a first layer of ensemble models and a second layer of ensemble models, the executing the tree-based prediction model to predict the emotion features of the plurality of speakers for a subsequent turn of the conversation comprises: inputting the audio embedding features, the text embedding features, the concatenated speaker embedding features, and the concatenated emotion features to the first layer of ensemble models for each of the first layer of ensemble model to predict intermediate emotion features of the plurality of speakers for the subsequent turn respectively, to obtain first layer emotion prediction results; concatenating the first layer emotion prediction results with the audio embedding features, the text embedding features, the concatenated speaker embedding features, and the concatenated emotion features as intermediate concatenated embedding features; inputting the intermediated concatenated embedding features to the second layer of ensemble models for each of the second layer of ensemble models to predict intermediate emotion features of the plurality of speakers for the subsequent turn respectively, to obtain second layer emotion prediction results; and determining the emotion features of the plurality of speakers for the subsequent turn based on the second layer prediction results. 2. The method of claim 1 , where the identifying the plurality of turns from the plurality of utterances comprises: identifying temporal consecutive utterances from the plurality of utterances as a turn such that the turn comprises utterances from at least two consecutive speakers. 3. The method of claim 1 , where the extracting the audio embedding features from the plurality of turns comprises: executing an audio-based feature extraction model to extract the audio embedding features from the plurality of turns. 4. The method of claim 1 , where the extracting the text embedding features from the plurality of text segments comprises: executing a text-based feature extraction model to extract the text embedding features from the plurality of text segments. 5. The method of claim 1 , where the method further comprises: performing speaker diarization on the audio data to generate the plurality of utterances of the speakers and label speaker information to the plurality of utterances; the obtaining the speaker embedding features associated with the audio data comprises: generating speaker embedding features for each of the speakers based on the speaker information labelled to the plurality of utterances. 6. The method of claim 1 , where the obtaining the plurality of text segments associated with the audio data comprises: performing speech recognition to convert the audio data into text data; and identifying the plurality of text segments from the text data by turn of the conversation. 7. The method of claim 1 , where a layer of the stacked ensemble models comprises a lightweight ensemble model or a deep learning ensemble model. 8. The method of claim 1 , the method further comprises: in response to an end of the conversation, generating a conversation analysis summary for the conversation based on the plurality of emotion features. 9. The method of claim 1 , where the method further comprises: identifying a stage of the conversation based on the audio data, the stage indicating progress of the conversation; and generating and outputting a conversation recommendation to facilitate the conversation based on the predicted emotion features of the plurality of speakers and the stage of the conversation. 10. The method of claim 9 , where the identifying the stage of the conversation based on the audio data comprises: identifying an intent of the conversation based on the audio data; and determine the stage of the conversation based on the intent of the conversation. 11. The method of claim 1 , where an emotion feature comprises an emotion state and an emotion attribute, the emotion state comprises angry, frustrated, sad, neutral, happy, or excited, and the emotion attribute comprises valence, activation, or dominance. 12. The method of claim 11 , where the method further comprises: performing acoustic analysis on the audio data to obtain emotion attributes for the plurality of turns. 13. A system for emotion recognition and forecasting in conversations, comprising: a memory having stored thereon executable instructions; a processor circuitry in communication with the memory, the processor circuitry when executing the instructions configured to: obtain an audio data of a conversation involving a plurality of speakers, the audio data comprises a plurality of utterances of the speakers; identify a plurality of turns of the conversation from the plurality of utterances, a turn representing a temporal speech window unit for analyzing emotion features of the speakers; extract audio embedding features from the plurality of turns; obtain a plurality of text segments associated with the audio data; extract text embedding features from the plurality of text segments; obtain speaker embedding features associated with the audio data; concatenating the speaker embedding features; obtain a plurality of emotion features corresponding to the plurality of turns, the plurality of emotion features indicating temporal emotion dynamics of the speakers over the turns; concatenate the plurality of emotion features; and execute a tree-based prediction model to predict emotion features of the plurality of speakers for a subsequent tum of the conversation based on the audio embedding features, text embedding features, the concatenated speaker embedding features, and the concatenated emotion features, wherein the tree-based prediction model comprises multiple layers of stacked ensemble models, and wherein the multiple layers of stacked ensemble models comprise a first layer of ensemble models and a second layer of ensemble models, the executing th

Assignees

Inventors

Classifications

  • characterised by the analysis technique · CPC title

  • Discourse or dialogue representation · CPC title

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • Ensemble learning · CPC title

  • Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12380915B2 cover?
A method and system for emotion recognition and forecasting are disclosed. The method may include obtaining an audio data of a conversation involving a plurality of speakers and identifying a plurality of turns of the conversation from the plurality of utterances. The method may further include extracting audio embedding features from the plurality of turns, obtaining a plurality of text segmen…
Who is the assignee on this patent?
Accenture Global Solutions Ltd
What technology area does this patent fall under?
Primary CPC classification G10L25/63. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 05 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).