Method and apparatus for recording and playing user voice in mobile terminal by synchronizing with text
US-9786267-B2 · Oct 10, 2017 · US
US10140973B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10140973-B1 |
| Application number | US-201615266116-A |
| Country | US |
| Kind code | B1 |
| Filing date | Sep 15, 2016 |
| Priority date | Sep 15, 2016 |
| Publication date | Nov 27, 2018 |
| Grant date | Nov 27, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems, methods, and devices for generating text-to-speech output using previously captured speech are described. Spoken audio is obtained and undergoes speech processing to create text. The resulting text is stored with the spoken audio, with both the text and the spoken audio being associated with the individual that spoke the audio. Various spoken audio and corresponding text are stored over time to create a library of speech units. When the individual sends a text message to a recipient, the text message is processed to determine portions of text, and the portions of text are compared to the library of text associated with the individual. When text in the library is identified, the system selects the spoken audio units associated with the identified stored text. The selected spoken audio units are then used to generate output audio data corresponding to the original text message, with the output audio data being sent to a device of the message recipient.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method, comprising: receiving first input audio data corresponding to an utterance; performing automatic speech recognition processing on the first input audio data to create first reference text data; in a database associated with a first user profile, storing a first association between the first input audio data and the first reference text data; receiving, from a first device associated with the first user profile, a message intended for a second device, the message including first text data; determining the first text data corresponds to the first reference text data; identifying the first input audio data in the database based at least in part on the first association; causing the first device to output a visual indication representing that the first text data corresponds to the first reference text data; generating, after causing the first device to output the visual indication, output audio data including the first input audio data; and sending, to the second device, the output audio data. 2. The computer-implemented method of claim 1 , further comprising: receiving second input audio data corresponding to a second utterance; performing automatic speech recognition processing on the second input audio data to create second reference text data; in the database, storing a second association between the second input audio data and the second reference text data; determining a pronunciation of the first text data; determining a first diphone identifier associated with the first reference text data; determining a second diphone identifier associated with the second reference text data; and determining the first diphone identifier and the second diphone identifier correspond to the pronunciation, wherein generating the output audio data comprises concatenating the first input audio data to the second input audio data. 3. The computer-implemented method of claim 2 , further comprising: associating the first reference text data with first pronunciation data; associating the second reference text data with second pronunciation data; receiving, from the first device, a third message intended for the second device, the third message including second text data; determining the second text data corresponds to a first word of the first reference text data and a second word of the second reference text data, the first word being identical to the second word; performing prosodic analysis processing on the second text data to determine third pronunciation data; and identifying the first word for generating second output audio data based at least in part on the first pronunciation data being at least similar to the third pronunciation data. 4. A system, comprising: at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the system to: receive first input audio data corresponding to at least one first utterance associated with user profile data; perform automatic speech recognition processing on the first input audio data to create first text data; associate the first text data with the user profile data, the first text data being associated with at least a portion of the first input audio data, the at least a portion of the first input audio data having a first prosodic characteristic; receive second text data; determine the second text data is associated with the user profile data; determine the first text data corresponds to at least a portion of the second text data; determine the at least a portion of the second text data is associated with third text data, the third text data being associated with second input audio data having a second prosodic characteristic; perform prosodic analysis processing on the second text data to determine a third prosodic characteristic; determine the third prosodic characteristic at least substantially matches the first prosodic characteristic; generate, after determining the third prosodic characteristic at least substantially matches the first prosodic characteristic, output audio data using the at least a portion of the first input audio data; and send the output audio data to a first device. 5. The system of claim 4 , wherein a first portion of the first input audio data represents a diphone and wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to: determine a pronunciation of the second text data; determine the diphone corresponds to the pronunciation; and generate the output audio data based at least in part on the diphone corresponding to the pronunciation. 6. The system of claim 4 , wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to: cause a second device to output a first visual indication representing a first portion of the second text data corresponds to the first text data; and cause the second device to output a second visual indication representing a second portion of the second text data does not correspond to the first text data, the first visual indication and the second visual indication being different with respect to at least color. 7. The system of claim 4 , wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to: receive third text data; determine the third text data does not correspond to the first text data; cause a second device to request further input audio corresponding to at least one second utterance corresponding to the third text data; receive, from the second device, second input audio data; and associate the third text data with the second input audio data and the user profile data. 8. The system of claim 4 , wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to: receive third text data representing a first word; determine the first word does not correspond to the first text data; perform natural language understanding processing on the third text data; determine a second word having a similar meaning as the first word; determine the second word corresponds to the first text data; and send, to a second device, first data representing the second word. 9. The system of claim 4 , wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to: receive third text data including a first portion and a second portion; determine a first portion of the first text data corresponding to the first portion of the third text data; determine a second portion of the first text data corresponding to the second portion of the third text data; and generate second output audio data at least partially corresponding to the first portion of the first text data and the second portion of the first text data. 10. The system of claim 9 , wherein the first portion of the first text data is a first sequence of words and the second portion of the first text data is a second sequence of words. 11. A computer-implemented method, comprising: receiving first input audio data corresponding to at least one first utterance associated with user profile data; performing automatic speech recognition processing on the first input audio data to create first text data; associating the first text data with the user profile data, the first text data being associated with at least a portion of the first input audio data; receiving second text data; determ
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Speech to text systems (G10L15/08 takes precedence) · CPC title
Concatenation rules · CPC title
Thesauruses; Synonyms · CPC title
Semantic analysis · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.