Automatically associating context-based sounds with text

US11727913B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11727913-B2
Application numberUS-201916725716-A
CountryUS
Kind codeB2
Filing dateDec 23, 2019
Priority dateDec 23, 2019
Publication dateAug 15, 2023
Grant dateAug 15, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A sound association system identifies one or more aurally active words in digital text. Aurally active words refer to words that denote particular sounds. Context-based sounds corresponding to the one or more aurally active words are also identified. Each context-based sound is anchored to or associated with the corresponding one or more aurally active words and is played back when the digital text is played back or read, providing context-based background sounds associated with the one or more aurally active words. For example, a context-based sound can be played back at a higher volume when the one or more aurally active words are played back or read, and at a lower volume when other words of the digital text are played back or read.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented by at least one processing device, the method comprising: receiving digital text; automatically identifying, using a text classification module of a multimodal classification module trained based on texts and sounds, an aurally active word in the digital text; automatically identifying multiple context-based sounds corresponding to the aurally active word in the digital text using a sound classification module implemented in a deep neural network trained to identify one or more sound tags; identifying multiple context-based sound identifiers based on the one or more sound tags, each context-based sound identifier being associated with one of the multiple context-based sounds; displaying the digital text and the multiple context-based sound identifiers; receiving user selection of a context-based sound of the multiple context-based sounds; and presenting the digital text concurrently with the context-based sound including audibly outputting the context-based sound at a higher volume during a time that the aurally active word is determined to be read than during times that the aurally active word is not determined to be read, wherein the higher volume of the context-based sound is based on an attention weight generated by the text classification module for the aurally active word to which the context-based sound is anchored. 2. The method of claim 1 , wherein presenting the digital text concurrently with the context-based sound comprises audibly outputting the digital text concurrently with the context-based sound. 3. The method of claim 2 , wherein audibly outputting the digital text concurrently with the context-based sound comprises audibly outputting the context-based sound at a higher volume during a time that the aurally active word is determined to be read than during times that the aurally active word is not determined to be read. 4. The method of claim 1 , each of the multiple context-based sounds corresponding to one of multiple sound tags, the automatically identifying the aurally active words comprising: identifying attention weights generated by an attention layer of a text classification module of the multimodal classification module, the attention weights indicating how much each of the words in the digital text contributes to generation of a sound tag of the multiple sound tags by the text classification module; and identifying, as the aurally active word, a word in the digital text having a highest attention weight. 5. The method of claim 1 , further comprising automatically identifying the multiple context-based sounds corresponding to the aurally active word by: generating a first set of tags by identifying a particular number of tags having a highest probability identified by an additional classification module of being ones of the multiple context-based sounds; generating, for each of multiple sounds, a second set of tags by identifying a particular number of tags having the highest probability identified by a sound classification module of being ones of the multiple context-based sounds; generating, for each of the second set of tags, a similarity score between the first set of tags and the second set of tags, the similarity score indicating the similarity of the first set of tags to the second set of tags; and selecting a particular number of sounds associated with the first set of tags having the second sets of tags with highest similarity scores with the first set of tags. 6. The method of claim 1 , further comprising automatically identifying the multiple context-based sounds corresponding to the aurally active word by: generating, by a text classification module of the multimodal classification module, a first embedding in an embedding space corresponding to the digital text; obtaining, for each of multiple sounds, a second embedding in the embedding space having been generated by a sound classification module of the multimodal classification module; determining, for each of the multiple sounds, a divergence score indicating a divergence of the first embedding and the second embedding; and selecting a particular number of sounds having a smallest divergence score. 7. The method of claim 6 , further comprising concatenating the first embedding and the second embedding. 8. The method of claim 1 , wherein the text classification module generates a first embedding for the digital text in an embedding space and the sound classification module generates a second embedding for each of multiple sounds in a sound database, and wherein an additional classification module generates a probability of the digital text corresponding to each of the multiple sounds in the sound database. 9. The method of claim 1 , further comprising clipping a duration of the context-based sound if the duration of the context-based sound is greater than a threshold duration. 10. The method of claim 1 , further comprising repeating an audible output of the context-based sound if a duration of the context-based sound is below a threshold duration. 11. A method implemented by at least one processing device, the method comprising: training a text classification module to identify a probability of a text input corresponding to each of multiple sound tags by minimizing a first loss function between sound tags identified by the text classification module for training data texts and training labels for the training data texts, each sound tag corresponding to a context-based sound associated with an aurally active word or phrase; training a sound classification module implemented in a deep neural network to identify a probability of each of multiple context-based sounds corresponding to each of the multiple sound tags by minimizing the first loss function between sound tags identified by the sound classification module for training data sounds and training labels for the training data sounds; providing an output of the text classification module and an output of the sound classification module to an additional classification module, the output of the text classification module comprising a first embedding for the text input in an embedding space rather than the probability of the text input corresponding to each of the multiple sound tags, the output of the sound classification module comprising a second embedding for a sound input in the embedding space rather than the probability of each of multiple context-based sounds corresponding to each of the multiple sound tags; and training the additional classification module, with the first embedding and the second embedding being the inputs to the additional classification module, to identify a probability of the text input corresponding to each of the multiple sound tags by minimizing a combination of a first loss to classify the text input correctly and a second loss to quantify a difference between the first embedding and the second embedding. 12. The method of claim 11 , further comprising initially training the text classification module and the sound classification module and then training the additional classification module. 13. The method of claim 12 , further comprising: training, concurrently with training the additional classification module, the text classification module and the sound classification module by minimizing the combination of the first loss and the second loss for sound tags generated by the text classification module and by minimizing the combination of the first loss and the second loss for sound tags generated by the sound classification module. 14. The method of claim 11 , wherein the text input comprises a sentence.

Assignees

Inventors

Classifications

  • G10L13/00Primary

    Speech synthesis; Text to speech systems · CPC title

  • G06F3/0482Primary

    Interaction with lists of selectable items, e.g. menus · CPC title

  • Audio in a user interface, e.g. using voice commands for navigating, audio feedback · CPC title

  • Browsing; Visualisation therefor · CPC title

  • Architecture of speech synthesisers · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11727913B2 cover?
A sound association system identifies one or more aurally active words in digital text. Aurally active words refer to words that denote particular sounds. Context-based sounds corresponding to the one or more aurally active words are also identified. Each context-based sound is anchored to or associated with the corresponding one or more aurally active words and is played back when the digital …
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G10L13/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 15 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).