Text-to-speech processing using previously speech processed data

US10140973B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10140973-B1
Application numberUS-201615266116-A
CountryUS
Kind codeB1
Filing dateSep 15, 2016
Priority dateSep 15, 2016
Publication dateNov 27, 2018
Grant dateNov 27, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods, and devices for generating text-to-speech output using previously captured speech are described. Spoken audio is obtained and undergoes speech processing to create text. The resulting text is stored with the spoken audio, with both the text and the spoken audio being associated with the individual that spoke the audio. Various spoken audio and corresponding text are stored over time to create a library of speech units. When the individual sends a text message to a recipient, the text message is processed to determine portions of text, and the portions of text are compared to the library of text associated with the individual. When text in the library is identified, the system selects the spoken audio units associated with the identified stored text. The selected spoken audio units are then used to generate output audio data corresponding to the original text message, with the output audio data being sent to a device of the message recipient.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: receiving first input audio data corresponding to an utterance; performing automatic speech recognition processing on the first input audio data to create first reference text data; in a database associated with a first user profile, storing a first association between the first input audio data and the first reference text data; receiving, from a first device associated with the first user profile, a message intended for a second device, the message including first text data; determining the first text data corresponds to the first reference text data; identifying the first input audio data in the database based at least in part on the first association; causing the first device to output a visual indication representing that the first text data corresponds to the first reference text data; generating, after causing the first device to output the visual indication, output audio data including the first input audio data; and sending, to the second device, the output audio data. 2. The computer-implemented method of claim 1 , further comprising: receiving second input audio data corresponding to a second utterance; performing automatic speech recognition processing on the second input audio data to create second reference text data; in the database, storing a second association between the second input audio data and the second reference text data; determining a pronunciation of the first text data; determining a first diphone identifier associated with the first reference text data; determining a second diphone identifier associated with the second reference text data; and determining the first diphone identifier and the second diphone identifier correspond to the pronunciation, wherein generating the output audio data comprises concatenating the first input audio data to the second input audio data. 3. The computer-implemented method of claim 2 , further comprising: associating the first reference text data with first pronunciation data; associating the second reference text data with second pronunciation data; receiving, from the first device, a third message intended for the second device, the third message including second text data; determining the second text data corresponds to a first word of the first reference text data and a second word of the second reference text data, the first word being identical to the second word; performing prosodic analysis processing on the second text data to determine third pronunciation data; and identifying the first word for generating second output audio data based at least in part on the first pronunciation data being at least similar to the third pronunciation data. 4. A system, comprising: at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the system to: receive first input audio data corresponding to at least one first utterance associated with user profile data; perform automatic speech recognition processing on the first input audio data to create first text data; associate the first text data with the user profile data, the first text data being associated with at least a portion of the first input audio data, the at least a portion of the first input audio data having a first prosodic characteristic; receive second text data; determine the second text data is associated with the user profile data; determine the first text data corresponds to at least a portion of the second text data; determine the at least a portion of the second text data is associated with third text data, the third text data being associated with second input audio data having a second prosodic characteristic; perform prosodic analysis processing on the second text data to determine a third prosodic characteristic; determine the third prosodic characteristic at least substantially matches the first prosodic characteristic; generate, after determining the third prosodic characteristic at least substantially matches the first prosodic characteristic, output audio data using the at least a portion of the first input audio data; and send the output audio data to a first device. 5. The system of claim 4 , wherein a first portion of the first input audio data represents a diphone and wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to: determine a pronunciation of the second text data; determine the diphone corresponds to the pronunciation; and generate the output audio data based at least in part on the diphone corresponding to the pronunciation. 6. The system of claim 4 , wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to: cause a second device to output a first visual indication representing a first portion of the second text data corresponds to the first text data; and cause the second device to output a second visual indication representing a second portion of the second text data does not correspond to the first text data, the first visual indication and the second visual indication being different with respect to at least color. 7. The system of claim 4 , wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to: receive third text data; determine the third text data does not correspond to the first text data; cause a second device to request further input audio corresponding to at least one second utterance corresponding to the third text data; receive, from the second device, second input audio data; and associate the third text data with the second input audio data and the user profile data. 8. The system of claim 4 , wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to: receive third text data representing a first word; determine the first word does not correspond to the first text data; perform natural language understanding processing on the third text data; determine a second word having a similar meaning as the first word; determine the second word corresponds to the first text data; and send, to a second device, first data representing the second word. 9. The system of claim 4 , wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to: receive third text data including a first portion and a second portion; determine a first portion of the first text data corresponding to the first portion of the third text data; determine a second portion of the first text data corresponding to the second portion of the third text data; and generate second output audio data at least partially corresponding to the first portion of the first text data and the second portion of the first text data. 10. The system of claim 9 , wherein the first portion of the first text data is a first sequence of words and the second portion of the first text data is a second sequence of words. 11. A computer-implemented method, comprising: receiving first input audio data corresponding to at least one first utterance associated with user profile data; performing automatic speech recognition processing on the first input audio data to create first text data; associating the first text data with the user profile data, the first text data being associated with at least a portion of the first input audio data; receiving second text data; determ

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Speech to text systems (G10L15/08 takes precedence) · CPC title

  • Concatenation rules · CPC title

  • Thesauruses; Synonyms · CPC title

  • Semantic analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10140973B1 cover?
Systems, methods, and devices for generating text-to-speech output using previously captured speech are described. Spoken audio is obtained and undergoes speech processing to create text. The resulting text is stored with the spoken audio, with both the text and the spoken audio being associated with the individual that spoke the audio. Various spoken audio and corresponding text are stored ove…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L13/10. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 27 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).