Voice message capturing system
US-10891959-B1 · Jan 12, 2021 · US
US11670285B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-11670285-B1 |
| Application number | US-202017102910-A |
| Country | US |
| Kind code | B1 |
| Filing date | Nov 24, 2020 |
| Priority date | Nov 24, 2020 |
| Publication date | Jun 6, 2023 |
| Grant date | Jun 6, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for an interactive turn-based reading experience are described. A system may take turns reading content, such as a book, with a user. The system may process audio data representing a user reading a portion of the content, determine reading evaluation data, and determine how to proceed for the next turn based on the reading evaluation data. For example, based on the reading evaluation data, the system may read a portion of the content by outputting synthesized speech representing the content, may ask the user re-read a portion of the content, or may ask the user to read a different, smaller portion of the content.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: receiving, from a device, first input audio data representing a spoken natural language input including a request to read a book, the first input audio data associated with a session identifier; processing the first input audio data to determine a book identifier associated with the book; receiving first text data associated with the book identifier, the first text data representing a first portion of the book; receiving, from the device, second input audio data including speech corresponding to the first portion of the book, the second input audio data associated with the session identifier; determining that the second input audio data corresponds to an entirety of the first portion of the book; determining, using a trained machine learning (ML) model, first reading evaluation data based on the second input audio data and the first text data, the first reading evaluation data associated with the session identifier; based on the first reading evaluation data, determining to output a second portion of the book; receiving second text data associated with the book identifier, the second text data representing the second portion of the book; performing text-to-speech (TTS) processing on the second text data to generate first output audio data representing the second portion of the book; and sending the first output audio data to the device. 2. The computer-implemented method of claim 1 , further comprising: after the first output audio data is sent, enabling a listening mode to capture speech; receiving, from the device, third input audio data corresponding to a third portion of the book, the third input audio data associated with the session identifier; determining that the third input audio data corresponds to an entirety of the third portion of the book; enabling the listening mode; determining second reading evaluation data based on the third input audio data and third text data representing the third portion of the book, the second reading evaluation data associated with the session identifier; based on the second reading evaluation data, sending, to the device, second output audio data representing a request to read the third portion of the book; enabling the listening mode; and receiving, from the device, fourth input audio data corresponding to the third portion of the book. 3. The computer-implemented method of claim 1 , further comprising: after the first output audio data is presented, enabling a listening mode to capture speech; receiving, from the device, third input audio data corresponding to a third portion of the book, the third input audio data associated with the session identifier, wherein the third portion of the book corresponds to a page in the book; determining that the third input audio data corresponds to an entirety of the third portion of the book; disabling the listening mode; determining second reading evaluation data based on the third input audio data and third text data representing the third portion of the book, the second reading evaluation data associated with the session identifier; based on the second reading evaluation data, determining to decrease amount of book to be read; and sending, to the device, second output audio data representing a request to read a fourth portion of the book, wherein the fourth portion of the book corresponds to a paragraph on the page. 4. The computer-implemented method of claim 1 , wherein determining the first reading evaluation data comprises: performing automatic speech recognition (ASR) processing using the second input audio data to determine ASR output data; processing the ASR output data with respect to the first text data to determine first data representing a reading accuracy, the first data based at least on one of: deletion of a word in the ASR output data with respect to the first text data, insertion of a word in the ASR output data with respect to the first text data, and substitution of a word in the ASR output data with respect to the first text data; processing the second input audio data using the trained ML model to determine second data representing a pronunciation accuracy, the trained ML model configured to perform phoneme alignment; and determining the first reading evaluation data using the first data and the second data. 5. A computer-implemented method comprising: receiving first input audio data corresponding to speech representing a first portion of content; determining that the first input audio data corresponds to an entirety of the first portion of the content; determining, using a first trained machine learning (ML) model, first reading evaluation data based on the first input audio data and the first portion of the content; based on the first reading evaluation data, determining to output a second portion of the content; performing text-to-speech (TTS) processing to generate first output audio data including synthesized speech corresponding to the second portion of the content; and outputting the first output audio data. 6. The computer-implemented method of claim 5 , further comprising: prior to receiving the first input audio data, receiving second input audio data representing a request to read content; receiving data representing the content; enabling a listening mode to capture speech; and disabling the listening mode after determining that the first input audio data corresponds to the entirety of the first portion of the content. 7. The computer-implemented method of claim 5 , further comprising: receiving second input audio data corresponding to reading of a third portion of content; determining that the second input audio data corresponds to an entirety of the third portion of the content; determining second reading evaluation data based on the second input audio data and the third portion of the content; based on the second reading evaluation data, outputting second output audio data representing a request to read the third portion of the content; and receiving third input audio data corresponding to the third portion of the content. 8. The computer-implemented method of claim 5 , further comprising: receiving second input audio data corresponding to reading of a third portion of content including a first number of words; determining that the second input audio data corresponds to an entirety of the third portion of the content; determining second reading evaluation data based on the second input audio data and the third portion of the content; and based on the second reading evaluation data, outputting second output audio data representing a request to read a fourth portion of the content including a second plurality of words, wherein the second plurality of words is less than the first number of words. 9. The computer-implemented method of claim 5 , wherein determining the first reading evaluation data comprises: performing automatic speech recognition (ASR) processing using the first input audio data to determine ASR output data; processing the ASR output data with respect to the first portion of the content to determine first data representing a reading accuracy; processing the first input audio data the first trained ML model to determine second data representing a pronunciation accuracy, the first trained ML model configured to perform phoneme alignment; and determining the first reading evaluation data using the first data and the second data. 10. The computer-implemented method of claim 5 , further comprising: prior to receiving the first input audio data, receiving second input audio data requesting to read a book; receiving image data representing a b
Details of speech synthesis systems, e.g. synthesiser structure or memory management · CPC title
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title
Phonemes, fenemes or fenones being the recognition units · CPC title
Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams · CPC title
Speech synthesis; Text to speech systems · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.