System and method for low-latency web-based text-to-speech without plugins

US9240180B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9240180-B2
Application numberUS-201113308860-A
CountryUS
Kind codeB2
Filing dateDec 1, 2011
Priority dateDec 1, 2011
Publication dateJan 19, 2016
Grant dateJan 19, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio and responds to the browser with the audio file. The system saves the audio file in a cache, with the file indexed by a unique identifier. As the system continues converting text into speech, when identical text appears the system uses the cached audio corresponding to the identical text without the need for re-synthesis via the TTS server.

First claim

Opening claim text (preview).

We claim: 1. A method comprising: receiving, from a client, text associated with a request for text-to-speech synthesis; performing, via a processor of a computing device, an analysis of the text to identify a plurality of intonational phrases in the text, wherein a size of the text being analyzed is based on a network latency; generating, via the processor, a first file containing text-to-speech data for a first intonational phrase of the plurality of intonational phrases using a first text-to-speech voice, wherein the first text-to-speech voice is selected based on user preferences, and wherein the first intonational phrase is indexed by a first unique identifier; generating, via the processor, a second file containing the text-to-speech data for a second intonational phrase of the plurality of intonational phrases using a second text-to-speech voice, wherein the second text-to-speech voice is selected based on the user preferences, and wherein the second intonational phrase is indexed by a second unique identifier; storing the first file and the second file in a cache on a web-server; transmitting the first file to the client in response to the request; and while the client plays the first file, generating additional files containing additional text-to-speech data for remaining intonational phrases of the plurality of intonational phrases, wherein the remaining intonational phrases comprise the second intonational phrase, and wherein each of the additional files is indexed by the first unique identifier plus a respective offset. 2. The method of claim 1 , wherein an intonational phrase is a phrase in which intonation within the phrase only depends on text inside the phrase. 3. The method of claim 1 , wherein the first file is indexed by a unique identifier. 4. The method of claim 1 , wherein the first file contains notification information. 5. The method of claim 1 , wherein the unique identifier comprises a text identifier and an offset index. 6. The method of claim 1 , wherein the additional files contain additional notification information. 7. The method of claim 1 , wherein generating the additional files occurs while the web browser plays the text-to-speech data in the first file. 8. The method of claim 1 , wherein the receiving and the transmitting occur on the web server, wherein the web server deletes items saved in the cache within an expiration threshold. 9. The method of claim 1 , further comprising transmitting one of the first file and a supplemental file of the additional files to the web browser in response to an additional request. 10. The method of claim 4 , wherein the notification information comprises synchronization data. 11. The method of claim 1 , wherein boundaries between intonational phrases comprise silence. 12. The method of claim 1 , further comprising: receiving text-to-speech settings from the client; and generating the first file and the additional files based on the text-to-speech settings. 13. The method of claim 1 , further comprising: generating parallel versions of the first file and the additional files using different text-to-speech voices. 14. A system comprising: a processor; a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising: receiving, from a client, text associated with a request for text-to-speech synthesis; performing, via a processor of a computing device, an analysis of the text to identify a plurality of intonational phrases in the text, wherein a size of the text being analyzed is based on a network latency; generating, via the processor, a first file containing text-to-speech data for a first intonational phrase of the plurality of intonational phrases using a first text-to-speech voice, wherein the first text-to-speech voice is selected based on user preferences, and wherein the first intonational phrase is indexed by a first unique identifier; generating, via the processor, a second file containing the text-to-speech data for a second intonational phrase of the plurality of intonational phrases using a second text-to-speech voice, wherein the second text-to-speech voice is selected based on the user preferences, and wherein the second intonational phrase is indexed by a second unique identifier; storing the first file and the second file in a cache on a web-server; transmitting the first file to the client in response to the request; and while the client plays the first file, generating additional files containing additional text-to-speech data for remaining intonational phrases of the plurality of intonational phrases, wherein the remaining intonational phrases comprise the second intonational phrase, and wherein each of the additional files is indexed by the first unique identifier plus a respective offset. 15. The system of claim 14 , wherein the operations are associated with a web browser. 16. The system of claim 15 , wherein no browser plugin is required for the operations. 17. The system of claim 14 , wherein the computer-readable storage medium has additional instructions stored which, when executed by the processor, result in operations comprising: receiving user input navigating to a different position within the text; identifying a new offset for the different position; and fetching a corresponding file from the server for playback based on the unique identifier and the new offset. 18. A computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising: receiving, from a client, text associated with a request for text-to-speech synthesis; performing, via a processor of a computing device, an analysis of the text to identify a plurality of intonational phrases in the text, wherein a size of the text being analyzed is based on a network latency; generating, via the processor, a first file containing text-to-speech data for a first intonational phrase of the plurality of intonational phrases using a first text-to-speech voice, wherein the first text-to-speech voice is selected based on user preferences, and wherein the first intonational phrase is indexed by a first unique identifier; generating, via the processor, a second file containing the text-to-speech data for a second intonational phrase of the plurality of intonational phrases using a second text-to-speech voice, wherein the second text-to-speech voice is selected based on the user preferences, and wherein the second intonational phrase is indexed by a second unique identifier; storing the first file and the second file in a cache on a web-server; transmitting the first file to the client in response to the request; and while the client plays the first file, generating additional files containing additional text-to-speech data for remaining intonational phrases of the plurality of intonational phrases, wherein the remaining intonational phrases comprise the second intonational phrase, and wherein each of the additional files is indexed by the first unique identifier plus a respective offset. 19. The computer-readable storage device of claim 18 , having additional instructions stored which, when executed by the computing device, cause the computing device to perform operations comprising: generating parallel versions of the first file and the additional files using different text-to-speech voices.

Assignees

Inventors

Classifications

  • G10L13/04Primary

    Details of speech synthesis systems, e.g. synthesiser structure or memory management · CPC title

  • G10L13/10Primary

    Prosody rules derived from text; Stress or intonation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9240180B2 cover?
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio …
Who is the assignee on this patent?
Conkie Alistair D, Beutnagel Mark Charles, Mishra Taniya, and 1 more
What technology area does this patent fall under?
Primary CPC classification G10L13/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 19 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).