Synthesized speech audio data generated on behalf of human participant in conversation

US12190859B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12190859-B2
Application numberUS-202017792012-A
CountryUS
Kind codeB2
Filing dateFeb 10, 2020
Priority dateFeb 10, 2020
Publication dateJan 7, 2025
Grant dateJan 7, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Generating synthesized speech audio data on behalf of a given user in a conversation. The synthesized speech audio data includes synthesized speech that incorporates textual segment(s). The textual segment(s) can include recognized text that results from processing spoken input, of the given user, using a speech recognition model and/or can include a selection of a rendered suggestion that conveys the textual segment(s). Some implementations dynamically determine one or more prosodic properties for use in speech synthesis of the textual segment, and generate the synthesized speech with the one or more determined prosodic properties. The prosodic properties can be determined based on the textual segment(s) used in speech synthesis, textual segment(s) corresponding to recent spoken input of additional participant(s), attribute(s) of relationship(s) between the given user and additional participant(s) in the conversation, and/or feature(s) of a current location for the conversation.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented by one or more processors, the method comprising: detecting, via one or more microphones of a client device of a given user, spoken input of the given user; determining, based on processing the spoken input of the given user, a textual segment for conveying in a conversation in which the given user is a participant; identifying an additional participant in the conversation, the additional participant being in addition to the given user, and the additional participant being physically located in an environment with the given user; determining at least one attribute of a relationship between the given user and the additional participant; determining, based on the at least one attribute of the relationship between the given user and the additional participant, a given set of one or more prosodic properties, wherein the given set of the one or more prosodic properties is a first set of the one or more prosodic properties in response to determining the at least one attribute of the relationship between the given user and the additional participant is a first attribute, and wherein the given set of the one or more prosodic properties is a second set of the one or more prosodic properties in response to determining the at least one attribute of the relationship between the given user and the additional participant is a second attribute; generating synthesized speech audio data that includes synthesized speech that incorporates the textual segment and that is synthesized with the given set of the one or more prosodic properties, wherein generating the synthesized speech audio data comprises synthesizing the synthesized speech with the given set of the one or more prosodic properties responsive to determining the given set of the one or more prosodic properties based on the attribute of the relationship between the given user and the additional participant; and causing the synthesized speech to be rendered via one or more speakers of the client device and/or an additional client device, wherein the rendered synthesized speech is audibly perceptible to the additional participant. 2. The method of claim 1 , wherein determining, based on processing the spoken input of the given user, the textual segment, comprises: processing the spoken input using a speech recognition model to generate the textual segment. 3. The method of claim 2 , wherein the speech recognition model is an on-device speech recognition model and/or is trained for recognizing speech of speech impaired users. 4. The method of claim 1 , further comprising, subsequent to causing the synthesized speech to be rendered: detecting, via one or more of the microphones of the client device, an additional participant spoken input, of the additional participant; processing the additional participant spoken input using a speech recognition model to generate an additional participant textual segment that is a recognition of the additional participant spoken input; determining that an additional textual segment is a candidate response to the additional participant textual segment; and determining to display a graphical element that conveys the additional textual segment responsive to determining that the additional textual segment is the candidate response to the additional participant textual segment. 5. The method of claim 4 , wherein identifying the additional participant in the conversation comprises: performing speaker identification using the additional participant spoken input; and identifying the additional participant based on the speaker identification. 6. The method of claim 5 , wherein performing the speaker identification comprises: generating, at the client device, a spoken input embedding based on processing the additional participant spoken input using a speaker identification model; and comparing, at the client device, the spoken input embedding to a pre-stored embedding for the additional participant, the pre-stored embedding being previously stored locally at the client device responsive to authorization by the additional participant. 7. The method of claim 4 , wherein determining that the additional textual segment is the candidate response to the additional participant textual segment is further based on the at least one attribute of the relationship between the given user and the additional participant. 8. The method of claim 7 , wherein determining that the additional textual segment is the candidate response to the additional participant textual segment is further based on the at least one attribute of the relationship between the given user and the additional participant comprises: generating a superset of initial candidate responses based on the additional participant textual segment, the superset including the additional textual segment; and selecting, from the superset of initial candidate responses, the additional textual segment as the candidate response based on the at least one of the attributes of the relationship between the given user and the additional participant. 9. The method of claim 4 , further comprising: determining at least one classification of a location of the client device; wherein determining that the additional textual segment is the candidate response to the additional participant textual segment is further based on the at least one classification of the location. 10. The method of claim 4 , further comprising: in response to receiving a user selection, from the given user, of the graphical element that conveys the additional textual segment: generating additional synthesized speech audio data that includes additional synthesized speech that incorporates the additional textual segment and that is synthesized with the one or more prosodic properties, wherein generating the additional synthesized speech audio data comprises synthesizing the additional synthesized speech with the one or more prosodic properties; and causing the additional synthesized speech to be rendered via one or more of the speakers of the client device and/or the additional client device, wherein the rendered additional synthesized speech is audibly perceptible to the additional participant. 11. The method of claim 1 , further comprising: identifying a further additional participant in the conversation, the further additional participant being in addition to the given user and being in addition to the additional participant; determining the given set of the one or more prosodic properties based on both: (a) the attribute of the relationship between the given user and the additional participant, and (b) one or more additional attributes of an additional relationship between the given user and the further additional participant. 12. The method of claim 1 , further comprising: identifying a further additional participant in the conversation, the further additional participant being in addition to the given user and being in addition to the additional participant; determining the given set of the one or more prosodic properties based on the attribute of the relationship between the given user and the additional participant, in lieu of one or more additional attributes of an additional relationship between the given user and the further additional participant, responsive to: determining that the relationship between the given user and the additional participant is more formal than the additional relationship between the given user and the further additional participant. 13. The method of claim 1 , further comprising: determining at least one classification of a location of the client device; wherein determining the

Assignees

Inventors

Classifications

  • Prosody rules derived from text; Stress or intonation · CPC title

  • Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands · CPC title

  • Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title

  • Detection of discrete points within a voice signal · CPC title

  • Speech to text systems (G10L15/08 takes precedence) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12190859B2 cover?
Generating synthesized speech audio data on behalf of a given user in a conversation. The synthesized speech audio data includes synthesized speech that incorporates textual segment(s). The textual segment(s) can include recognized text that results from processing spoken input, of the given user, using a speech recognition model and/or can include a selection of a rendered suggestion that conv…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L13/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 07 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).