System(s) and method(s) for causing contextually relevant emoji(s) to be visually rendered for presentation to user(s) in smart dictation

US12346652B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12346652-B2
Application numberUS-202217957489-A
CountryUS
Kind codeB2
Filing dateSep 30, 2022
Priority dateSep 5, 2022
Publication dateJul 1, 2025
Grant dateJul 1, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Implementations described herein relate to causing emoji(s) that are associated with a given emotion class expressed by a spoken utterance to be visually rendered for presentation to a user at a display of a client device of the user. Processor(s) of the client device may receive audio data that captures the spoken utterance, process the audio data to generate textual data that is predicted to correspond to the spoken utterance, and cause a transcription of the textual data to be visually rendered for presentation to the user via the display. Further, the processor(s) may determine, based on processing the textual data, whether the spoken utterance expresses a given emotion class. In response to determining that the spoken utterance expresses the given emotion class, the processor(s) may cause emoji(s) that are stored in association with the given emotion class to be visually rendered for presentation to the user via the display.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented by one or more processors, the method comprising: during a dictation session between a user of a client device and an automated assistant executing at least in part at the client device: receiving audio data that captures a spoken utterance of the user of the client device, the audio data being generated by one or more microphones of the client device; processing, using an automatic speech recognition (ASR) model, the audio data that captures the spoken utterance of the user to generate textual data that is predicted to correspond to the spoken utterance; causing a transcription that includes the textual data that is predicted to correspond to the spoken utterance to be visually rendered for presentation to the user via a display of the client device; determining, based on processing the textual data that is predicted to correspond to the spoken utterance, whether the spoken utterance expresses a given emotion class from among a plurality of disparate emotion classes; and in response to determining that the spoken utterance expresses a given emotion class from among the plurality of disparate emotion classes: causing one or more emojis stored in association with the given emotion class to be visually rendered for presentation to the user via the display of the client device, wherein each of the one or more emojis is selectable, wherein a user selection of a given emoji, of the one or more emojis, causes the given emoji to be incorporated into the transcription that includes the textual data that is predicted to correspond to the spoken utterance, and wherein the one or more emojis are rendered in a portion of the display that is visually distinct from the transcription that includes the textual data that is predicted to correspond to the spoken utterance; determining whether the user selection of the given emoji is received; in response to determining the user selection of the given emoji is received: causing the given emoji to be incorporated into the transcription that includes the textual data that is predicted to correspond to the spoken utterance; and in response to determining that no user selection of the given emoji is received: causing one or more commands that are associated with the transcription to supplant the one or more emojis that are rendered in the portion of the display that is visually distinct from the transcription that includes the textual data that is predicted to correspond to the spoken utterance. 2. The method of claim 1 , further comprising: in response to determining that the spoken utterance does not express a given emotion class from among the plurality of disparate emotion classes: determining, based on the textual data that is predicted to correspond to the spoken utterance, whether the spoken utterance includes one or more terms that are stored in association with one or more corresponding emojis; and in response to determining that the spoken utterance includes one or more terms that are stored in association with one or more corresponding emojis: causing one or more of the corresponding emojis that are stored in association with one or more of the terms included in the spoken utterance to be visually rendered for presentation to the user via the display of the client device. 3. The method of claim 2 , further comprising: in response to determining that the spoken utterance does not include any terms that are stored in association with one or more corresponding emojis: causing one or more commands that are associated with the transcription to be visually rendered for presentation to the user via the display of the client device. 4. The method of claim 1 , further comprising: in response to determining that the spoken utterance does not express a given emotion class from among the plurality of disparate emotion classes: causing one or more commands that are associated with the transcription to be visually rendered for presentation to the user via the display of the client device. 5. The method of claim 1 , wherein determining whether the spoken utterance expresses a given emotion class, from among a plurality of disparate emotion classes based on processing the textual data that is predicted to correspond to the spoken utterance, comprises: processing, using an emotion classifier, the textual data that is predicted to correspond to the spoken utterance to generate emotion classifier output; determining, based on the emotion classifier output, a confidence value for the given emotion class; and determining whether the spoken utterance expresses the given emotion class based on the confidence value for the given emotion class. 6. The method of claim 5 , wherein determining that the spoken utterance expresses the given emotion class is based on the confidence value for the given emotion class satisfying a first threshold confidence value. 7. The method of claim 6 , wherein causing the one or more emojis stored in association with the given emotion class to be visually rendered for presentation to the user is based on the confidence value associated with the given emotion class satisfying the first threshold confidence value. 8. The method of claim 7 , causing the one or more emojis stored in association with the given emotion class to be visually rendered for presentation to the user is further based on a threshold duration of time lapsing with respect to the user providing the spoken utterance. 9. The method of claim 5 , further comprising: processing, using the emotion classifier, the audio data that captures the spoken utterance to generate the emotion classifier output. 10. The method of claim 5 , further comprising: in response to determining that the spoken utterance expresses multiple emotion classes from among the plurality of emotion classes: determining, based on the textual data that is predicted to correspond to the spoken utterance, whether the spoken utterance includes one or more terms that are associated with one or more corresponding emojis; and in response to determining that the spoken utterance includes one or more terms that are associated with one or more corresponding emojis: causing one or more of the corresponding emojis that are associated with one or more of the terms included in the spoken utterance to be visually rendered for presentation to the user via the display of the client device. 11. The method of claim 10 , wherein the multiple emotion classes include the given emotion class and a given additional emotion class, and wherein determining that the spoken utterance expresses the multiple emotion classes is based on both the confidence value for the given emotion class and an additional confidence value for the given additional emotion class, that is determined based on the emotion classifier output, failing to satisfy a first threshold confidence value, but satisfying a second threshold confidence value. 12. The method of claim 1 , wherein the user selection of the given emoji comprises touch input directed to the given emoji. 13. The method of claim 1 , wherein the user selection of the given emoji comprises an additional spoken utterance directed to a given emoji reference for the given emoji. 14. The method of claim 13 , further comprising: receiving additional audio data that captures the additional spoken utterance of the user, the additional audio data being generated by the one or more microphones of the client device; and processing, using the ASR model, the additional audio data that captures the additional spoken utterance of the user to identify the given emoji refere

Assignees

Inventors

Classifications

  • Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • Probabilistic grammars, e.g. word n-grams · CPC title

  • using a touch-screen or digitiser, e.g. input of commands through traced gestures · CPC title

  • Recognition of textual entities · CPC title

  • Execution procedure of a spoken command · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12346652B2 cover?
Implementations described herein relate to causing emoji(s) that are associated with a given emotion class expressed by a spoken utterance to be visually rendered for presentation to a user at a display of a client device of the user. Processor(s) of the client device may receive audio data that captures the spoken utterance, process the audio data to generate textual data that is predicted to …
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/166. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).