Automatically generating speech markup language tags for text

US11380300B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11380300-B2
Application numberUS-202016777360-A
CountryUS
Kind codeB2
Filing dateJan 30, 2020
Priority dateOct 11, 2019
Publication dateJul 5, 2022
Grant dateJul 5, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In particular embodiments, an apparatus comprises a non-transitory computer-readable storage media and a processor coupled to the media executes instructions to: access a plurality of text, generate, using one or more natural language understanding (NLU) models, one or more scores for at least a portion of the plurality of text. The apparatus determines, based on the scores, one or more prosodic values corresponding to the portion of the plurality of text. The apparatus determines, based on the one or more prosodic values, one or more speech synthesis markup language (SSML) tags. The apparatus then generates, based on the prosodic values, SSML-tagged data comprising each determined SSML tag and that tag's location in the plurality of text.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus, comprising: one or more non-transitory computer-readable storage media embodying instructions; and one or more processors coupled to the storage media and configured to execute the instructions to: access a plurality of text; generate, using one or more natural language understanding (NLU) models, a sentiment class score indicative of one or more emotions for at least a portion of the plurality of text and a subjectivity score indicative of subjectivity for at least the portion of the plurality of text; determine, based on the subjectivity score, a rate of change in pitch or rate values for the portion of the plurality of text; determine, based on the sentiment class score and the subjectivity score, one or more prosodic values corresponding to the portion of the plurality of text; determine, based on the one or more prosodic values, one or more speech synthesis markup language (SSML) tags corresponding to the one or more emotions indicated by the sentiment class score; and generate, based on the prosodic values, SSML-tagged data comprising the determined one or more SSML tags and respective tag location in the portion of the plurality of text. 2. The apparatus of claim 1 , wherein: the apparatus further comprises a client computing device comprising a speaker; and the one or more processors are further configured to execute the instructions to: access the plurality of text based on a user input received at the client computing device; and initiate transmission of speech output to the speaker, wherein the speech output comprises the plurality of text with instructions to verbalize the portion of the plurality of text according to the SSML-tagged data. 3. The apparatus of claim 1 , wherein: the apparatus further comprises a server computing device; and the one or more processors are further configured to execute the instructions to: receive an identification of the portion of the plurality of text based on an input of a user of a client computing device; and transmit the SSML-tagged data to the client computing device. 4. The apparatus of claim 1 , wherein: the prosodic values comprise a pitch value and a rate value; and the one or more processors are further configured to execute the instructions to dynamically set minimum and maximum ranges for the pitch value and the rate value based on the subjectivity score. 5. The apparatus of claim 1 , wherein the one or more processors are further configured to execute the instructions to: identify in the portion of the plurality of text a plurality of sentences and words; and generate a set of scores including one or more of: the subjectivity score for each sentence of the portion of the plurality of text; a polarity score for each sentence of the portion of the plurality of text; or an importance score for each sentence or each word of the portion of the plurality of text. 6. The apparatus of claim 1 , wherein the one or more NLU models comprise a first NLU model configured to: categorize the portion of the plurality of text according to a set of topics; and generate a polarity score and the subjectivity score for each sentence of the portion of the plurality of text. 7. The apparatus of claim 6 , wherein the one or more NLU models further comprise a second NLU model configured to generate an importance score for each of a plurality of portions of the plurality of text. 8. The apparatus of claim 7 , wherein the plurality of portions of the plurality of text comprise one or more of a sentence, a phrase, or a word in the plurality of text. 9. The apparatus of claim 7 , wherein the one or more NLU models further comprise a third NLU model configured to identify as a trending topic one or more words or phrases in the portions of the plurality of text. 10. The apparatus of claim 9 , wherein the inflection characteristics comprise at least one of: an upward inflection, a downward inflection, or a circumflex inflection. 11. The apparatus of claim 1 , wherein the one or more processors are further configured to execute the instructions to: generate word-level importance scores for words or phrases in the portion of the plurality of text; and determine, based on the word-level importance scores, inflection characteristics for the portion of the plurality of text. 12. The apparatus of claim 1 , wherein the one or more prosodic values correspond to one or more of a pitch, a rate of speech, a volume of speech, an amount of emphasis, or a length of a pause. 13. The apparatus of claim 1 , wherein to determine the one or more prosodic values based on the sentiment class score, the one or more processors are further configured to execute the instructions to: provide, to a neural network, the portion of the plurality of text and the sentiment class score from the one or more NLU models; and receive, from the neural network, the one or more prosodic values corresponding to the portion of the plurality of text. 14. One or more non-transitory computer-readable storage media embodying instructions that, when executed by one or more processors, cause the one or more processors to: access a plurality of text; generate, using one or more natural language understanding (NLU) models, a sentiment class score indicative of one or more emotions for at least a portion of the plurality of text and a subjectivity score indicative of subjectivity for at least the portion of the plurality of text; determine, based on the subjectivity score, a rate of change in pitch or rate values for the portion of the plurality of text; determine, based on the sentiment class score and the subjectivity score, one or more prosodic values corresponding to the portion of the plurality of text; determine, based on the one or more prosodic values, one or more speech synthesis markup language (SSML) tags corresponding to the one or more emotions indicated by the sentiment class score; and generate, based on the prosodic values, SSML-tagged data comprising the determined one or more SSML tags and respective tag location in the portion of the plurality of text. 15. The non-transitory computer-readable storage media of claim 14 , wherein the instructions further comprise instructions to: access the plurality of text based on a user input received at the client computing device; and initiate transmission of speech output to the speaker, wherein the speech output comprises the plurality of text with instructions to verbalize the portion of the plurality of text according to the SSML-tagged data. 16. A method performed by one or more processors of a computing system, comprising: accessing a plurality of text; generating, using one or more natural language understanding (NLU) models a sentiment class score indicative of one or more emotions for at least a portion of the plurality of text and a subjectivity score indicative of subjectivity for at least the portion of the plurality of text; determine, based on the subjectivity score, a rate of change in pitch or rate values for the portion of the plurality of text; determining, based on the sentiment class score and the subjectivity score, one or more prosodic values corresponding to the portion of the plurality of text; determining, based on the one or more prosodic values, one or more speech synthesis markup language (SSML) tags corresponding to the one or more emotions indicated by the sentiment class score; and generating, based on the prosodic values, SSML-tagged data comprising the determined one or more SSML tags and respective tag in the portion of the plur

Assignees

Inventors

Classifications

  • Architecture of speech synthesisers · CPC title

  • Pitch control · CPC title

  • G10L13/10Primary

    Prosody rules derived from text; Stress or intonation · CPC title

  • Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD] · CPC title

  • G10L13/08Primary

    Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11380300B2 cover?
In particular embodiments, an apparatus comprises a non-transitory computer-readable storage media and a processor coupled to the media executes instructions to: access a plurality of text, generate, using one or more natural language understanding (NLU) models, one or more scores for at least a portion of the plurality of text. The apparatus determines, based on the scores, one or more prosodi…
Who is the assignee on this patent?
Samsung Electronics Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L13/10. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 05 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).