Processing speech signals of a user to generate a visual representation of the user

US11568864B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11568864-B2
Application numberUS-201916539701-A
CountryUS
Kind codeB2
Filing dateAug 13, 2019
Priority dateAug 13, 2018
Publication dateJan 31, 2023
Grant dateJan 31, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing device executes a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal, maps a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector; and generates a visual representation of at least a portion of the speaker based on the mapping, the visual representation comprising the visual feature.

First claim

Opening claim text (preview).

What is claimed is: 1. A computing system for generating image data representing a speaker's face, the computing system comprising: a detection device configured to route data representing a voice signal to one or more processors that generate a response to the voice signal; and a data processing device comprising the one or more processors, the data processing device configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal by performing operations comprising: executing a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal; mapping a signal feature of the feature vector to a visual feature representing at least a portion of a physical appearance of a given speaker, the mapping based on a modality transfer function specifying a relationship between the visual feature included in the physical appearance of the given speaker and the signal feature of the feature vector; reconstructing, based on the visual feature that is mapped to the signal feature in the feature vector, a rendered representation of the speaker that approximates a real physical appearance of the speaker by including the visual feature; wherein the reconstructing is performed independent of an image representing the physical appearance of the speaker being provided to the data processing device; and generating, independent of the image representing the physical appearance of the speaker being provided to the data processing device, a visual representation of at least a portion of the speaker based on the rendered representation of the speaker. 2. The computing system of claim 1 , wherein parameters of the voice embedding function that specify which of the one or more signal features of the voice signal are included in the feature vector are trained with one or more covariate classifiers that receive image data and voice signals. 3. The computing system of claim 1 , further comprising generating an inference of a value for the visual feature based on a known correlation of the one or more signal features of the voice signal to the visual feature of the speaker. 4. The computing system of claim 3 , where value for the visual feature comprises a size or relative proportions of articulators and vocal chambers of the speaker. 5. The computing system of claim 1 , wherein the visual representation comprises a reconstructed representation of a face of the speaker. 6. The computing system of claim 1 , wherein at least one of the one or more signal features of the feature vector comprises a voice quality feature, wherein the voice quality feature is related deterministically to measurements of a vocal tract of the speaker, wherein the measurements of the vocal tract are related to measurements of a face of the speaker, and wherein the data processing device is configured to recreate a geometry and of the face of the speaker based on determining the voice quality feature. 7. The computing system of claim 1 , the operations further comprising receiving, from the detection device, data comprising a template face, and modifying the data comprising the template face to incorporate the visual feature. 8. The computing system of claim 1 , where the visual feature comprises one or more of a skull structure, a gender of the speaker, an ethnicity of the speaker, a facial landmark of the speaker, or a nose structure of the speaker. 9. The computing system of claim 1 , wherein the operations further comprise: generating a facial image of the speaker in two or three dimensions independent of receiving data comprising a template image. 10. The computing system of claim 1 , where the voice embedding function comprises a regression function configured to enable the data processing device to generate a statistically plausible face that incorporates the visual feature. 11. The computing system of claim 1 , where the voice embedding function comprises a generative model configured to enable the data processing device to generate a statistically plausible face that incorporates the visual feature. 12. The computing system of claim 1 , wherein the data processing device is configured to receive auxiliary data about the speaker comprising an age, a height, a gender, an ethnicity, or a body-mass index (BMI) value. 13. The computing system of claim 12 , wherein the data processing device is configured to estimate one or more body indices of the speaker based on the auxiliary data, wherein the visual representation of the speaker comprises a full-body representation based on the one or more body indices. 14. The computing system of claim 13 , where the body indices are represented by a vector that includes a number of linear and volumetric characterizations of a body of the speaker. 15. The computing system of claim 13 , wherein a relation between visual features and the body indices is modelled by a neural network that is trained from training data comprising at least one of image data representing faces of speakers and voice signals. 16. A computing system for generating a voice signal, the computing system comprising: a detection device configured to route data comprising an image of a speaker to one or more processors that generate a response to the image data; and a data processing device comprising the one or more processors, the data processing device configured to, in response to receiving the image of the speaker, generate a simulation of a voice signal that approximates a real voice of the speaker comprising a vocal resonance value, a vocal anti-resonance value, a pitch value, or a glottal waveform, the generating being based on performing operations comprising: executing a face embedding function to generate a feature vector from the image data representing geometric facial features of the speaker represented in the image, the facial features comprising a portion of a facial geometry; mapping the portion of the facial geometry of the feature vector to a signal feature of the voice signal by a modality transfer function specifying a relationship between the geometric facial features of the image and the signal feature of the voice signal, the signal feature comprising at least one of a vocal resonance value, a vocal anti-resonance value, a pitch, or an estimated glottal flow waveform; and generating, based on the mapping, the voice signal to simulate the voice that approximates the real voice of the speaker in the image, the voice signal comprising the signal feature corresponding to the vocal resonance value, the vocal anti-resonance value, the pitch value, or the glottal flow waveform. 17. The computing system of claim 16 , wherein mapping comprises: determining, by voice quality generation logic, a voice quality of the voice signal comprising one or more spectral features; and determining, by content generator logic, a style of the voice signal, a language of the voice signal, or an accent for the voice signal that includes the one or more spectral features. 18. The computing system of claim 17 , where the voice quality generator logic is configured to map visual features derived from facial images to estimates of one or more subcomponents of voice quality. 19. The computing system of claim 17 , wherein the voice quality generation logic determines the voice quality based on training data comprising facial image-voice quality pairs. 20. The computing system of claim 17 , wherein the voice quality generation logic d

Assignees

Inventors

Classifications

  • Creating or editing images; Combining images with text · CPC title

  • Speech synthesis; Text to speech systems · CPC title

  • Speech to text systems (G10L15/08 takes precedence) · CPC title

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • G10L15/22Primary

    Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11568864B2 cover?
A computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing …
Who is the assignee on this patent?
Univ Carnegie Mellon
What technology area does this patent fall under?
Primary CPC classification G10L15/22. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 31 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).