Automotive visual speech recognition
US-2021065712-A1 · Mar 4, 2021 · US
US11257493B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11257493-B2 |
| Application number | US-201916509029-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 11, 2019 |
| Priority date | Jul 11, 2019 |
| Publication date | Feb 22, 2022 |
| Grant date | Feb 22, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods for processing speech are described. In certain examples, image data is used to generate visual feature tensors and audio data is used to generate audio feature tensors. The visual feature tensors and the audio feature tensors are used by a linguistic model to determine linguistic features that are usable to parse an utterance of a user. The generation of the feature tensors may be jointly configured with the linguistic model. Systems may be provided in a client-server architecture.
Opening claim text (preview).
What is claimed is: 1. A client device for processing speech comprising: an audio capture device to capture audio data associated with an utterance from a user; an image capture device to capture frames of image data, the image data featuring an environment of the user; a visual feature extractor to receive the frames of image data from the image capture device and to generate one or more visual feature tensors, the visual feature tensors providing a compressed representation of the frames of image data; an audio feature extractor to receive the audio data from the audio capture device and to generate one or more audio feature tensors; and a transmitter to transmit the visual feature tensors and the audio feature tensors to a server device, the server device being configured to supply at least the visual feature tensors and the audio feature tensors to a linguistic model, the linguistic model being configured to determine linguistic features that are usable to parse the utterance, wherein the visual feature extractor and the audio feature extractor are jointly configured with the linguistic model. 2. The client device of claim 1 , wherein one or more of the visual feature extractor and the audio feature extractor comprise a neural network architecture. 3. The client device of claim 1 , wherein the visual feature tensors comprise a numeric representation of a visual context for the environment, and wherein the transmitter is configured to transmit the audio data to the server device with the audio feature tensors, the linguistic model of the server device being configured, using the audio and visual feature tensors, to determine linguistic features based on the audio data. 4. The client device of claim 1 , wherein the image data comprises video data, the audio data being temporally correlated with the video data, and wherein the visual feature extractor and the audio feature extractor are applied in parallel to the video data and the audio data. 5. The client device of claim 1 , wherein the visual feature extractor comprises: a first convolutional neural network architecture comprising a plurality of layers including a first input layer to receive a frame of image data and a first output layer, wherein the convolutional neural network architecture is parameterized using a set of trained parameters for each of the plurality of layers, the set of trained parameters being derived from a training operation with one or more additional classification layers coupled to the first output layer; and a second neural network architecture comprising one or more layers including a second input layer and a second output layer, the second input layer being coupled to the first output layer of the convolutional neural network architecture, the second output layer having a dimensionality that is less than the dimensionality of the first output layer. 6. The client device of claim 5 , wherein the second neural network architecture is jointly trained with the audio feature extractor and the linguistic model in a training operation, the set of trained parameters for the first convolutional neural network architecture being fixed during the training operation. 7. A server device for processing speech comprising: a receiver to receive one or more visual feature tensors and one or more audio feature tensors from a client device, the visual feature tensors being generated by a visual feature extractor of the client device based on frames of image data captured by the client device, the frames of image data featuring an environment of the client device and the visual feature tensors providing a compressed representation of the frames of image data, the audio feature tensors being generated by an audio feature extractor of the client device based on corresponding audio data captured by the client device in association with an utterance of a user; and a linguistic model to receive the visual feature tensors and the audio feature tensors and to determine linguistic features that are usable to parse the utterance, wherein the linguistic model is jointly configured with the visual feature extractor and the audio feature extractor of the client device. 8. The server device of claim 7 , comprising: an attention pre-processor to apply a weighting to the audio and visual feature tensors prior to use by the linguistic model. 9. The server device of claim 7 , wherein the linguistic model comprises a neural network architecture that receives the audio and visual feature tensors as an input and that outputs a text representation of the utterance. 10. The server device of claim 7 , wherein: the audio feature tensors comprise a representation of an audio context for the environment and the visual feature tensors comprise a representation of a visual context for the environment, the receiver of the server device is configured to receive the audio data in addition to the audio feature tensors, and the linguistic model comprises an acoustic model to generate phoneme data for use in parsing the utterance from the audio data, the acoustic model being configured based on the audio and visual feature tensors. 11. The server device of claim 10 , wherein the acoustic model comprises: a database of acoustic model configurations; an acoustic model selector to select an acoustic model configuration from the database based on a joint set of the audio and visual feature tensors; and an acoustic model instance to process the audio data, the acoustic model instance being instantiated based on the acoustic model configuration selected by the acoustic model selector, the acoustic model instance being configured to generate the phoneme data for use in parsing the utterance. 12. The server device of claim 10 , wherein the linguistic model further comprises: a language model communicatively coupled to the acoustic model to receive the phoneme data and to generate text data representing the utterance, wherein the language model is configured to receive the audio feature tensors and the visual feature tensors as an input for use in generating the text data representing the utterance. 13. A method for processing speech at a client device, the method comprising: capturing, at the client device, audio data associated with an utterance from a user; capturing, at the client device, image data featuring an environment of the user; extracting, using a visual feature extractor at the client device, a set of visual feature tensors from one or more frames of the image data, the set of visual feature tensors providing a compressed representation of the frames of image data; extracting, using an audio feature extractor at the client device, a set of audio feature tensors from the audio data; and transmitting, at the client device, the set of audio and visual feature tensors to a server device, the server device being configured to supply at least the visual feature tensors and the audio feature tensors to a linguistic model, the linguistic model being configured to determine a set of linguistic features that are usable to parse the utterance, wherein the visual feature extractor and the audio feature extractor are jointly configured with the linguistic model. 14. The method of claim 13 , comprising: receiving, at the client device, a response to the utterance from the server device; and providing, at the client device, a response to the user based on the response to the utterance received from the server device. 15. The method of claim 13 , wherein extracting, using the visual feature extractor, comprises: providing data derived from the captured image data t
Speech recognition using non-acoustical features · CPC title
using neural networks · CPC title
using classification, e.g. of video objects · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
Validation; Performance evaluation; Active pattern learning techniques · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.