Vision-assisted speech processing

US11257493B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11257493-B2
Application numberUS-201916509029-A
CountryUS
Kind codeB2
Filing dateJul 11, 2019
Priority dateJul 11, 2019
Publication dateFeb 22, 2022
Grant dateFeb 22, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for processing speech are described. In certain examples, image data is used to generate visual feature tensors and audio data is used to generate audio feature tensors. The visual feature tensors and the audio feature tensors are used by a linguistic model to determine linguistic features that are usable to parse an utterance of a user. The generation of the feature tensors may be jointly configured with the linguistic model. Systems may be provided in a client-server architecture.

First claim

Opening claim text (preview).

What is claimed is: 1. A client device for processing speech comprising: an audio capture device to capture audio data associated with an utterance from a user; an image capture device to capture frames of image data, the image data featuring an environment of the user; a visual feature extractor to receive the frames of image data from the image capture device and to generate one or more visual feature tensors, the visual feature tensors providing a compressed representation of the frames of image data; an audio feature extractor to receive the audio data from the audio capture device and to generate one or more audio feature tensors; and a transmitter to transmit the visual feature tensors and the audio feature tensors to a server device, the server device being configured to supply at least the visual feature tensors and the audio feature tensors to a linguistic model, the linguistic model being configured to determine linguistic features that are usable to parse the utterance, wherein the visual feature extractor and the audio feature extractor are jointly configured with the linguistic model. 2. The client device of claim 1 , wherein one or more of the visual feature extractor and the audio feature extractor comprise a neural network architecture. 3. The client device of claim 1 , wherein the visual feature tensors comprise a numeric representation of a visual context for the environment, and wherein the transmitter is configured to transmit the audio data to the server device with the audio feature tensors, the linguistic model of the server device being configured, using the audio and visual feature tensors, to determine linguistic features based on the audio data. 4. The client device of claim 1 , wherein the image data comprises video data, the audio data being temporally correlated with the video data, and wherein the visual feature extractor and the audio feature extractor are applied in parallel to the video data and the audio data. 5. The client device of claim 1 , wherein the visual feature extractor comprises: a first convolutional neural network architecture comprising a plurality of layers including a first input layer to receive a frame of image data and a first output layer, wherein the convolutional neural network architecture is parameterized using a set of trained parameters for each of the plurality of layers, the set of trained parameters being derived from a training operation with one or more additional classification layers coupled to the first output layer; and a second neural network architecture comprising one or more layers including a second input layer and a second output layer, the second input layer being coupled to the first output layer of the convolutional neural network architecture, the second output layer having a dimensionality that is less than the dimensionality of the first output layer. 6. The client device of claim 5 , wherein the second neural network architecture is jointly trained with the audio feature extractor and the linguistic model in a training operation, the set of trained parameters for the first convolutional neural network architecture being fixed during the training operation. 7. A server device for processing speech comprising: a receiver to receive one or more visual feature tensors and one or more audio feature tensors from a client device, the visual feature tensors being generated by a visual feature extractor of the client device based on frames of image data captured by the client device, the frames of image data featuring an environment of the client device and the visual feature tensors providing a compressed representation of the frames of image data, the audio feature tensors being generated by an audio feature extractor of the client device based on corresponding audio data captured by the client device in association with an utterance of a user; and a linguistic model to receive the visual feature tensors and the audio feature tensors and to determine linguistic features that are usable to parse the utterance, wherein the linguistic model is jointly configured with the visual feature extractor and the audio feature extractor of the client device. 8. The server device of claim 7 , comprising: an attention pre-processor to apply a weighting to the audio and visual feature tensors prior to use by the linguistic model. 9. The server device of claim 7 , wherein the linguistic model comprises a neural network architecture that receives the audio and visual feature tensors as an input and that outputs a text representation of the utterance. 10. The server device of claim 7 , wherein: the audio feature tensors comprise a representation of an audio context for the environment and the visual feature tensors comprise a representation of a visual context for the environment, the receiver of the server device is configured to receive the audio data in addition to the audio feature tensors, and the linguistic model comprises an acoustic model to generate phoneme data for use in parsing the utterance from the audio data, the acoustic model being configured based on the audio and visual feature tensors. 11. The server device of claim 10 , wherein the acoustic model comprises: a database of acoustic model configurations; an acoustic model selector to select an acoustic model configuration from the database based on a joint set of the audio and visual feature tensors; and an acoustic model instance to process the audio data, the acoustic model instance being instantiated based on the acoustic model configuration selected by the acoustic model selector, the acoustic model instance being configured to generate the phoneme data for use in parsing the utterance. 12. The server device of claim 10 , wherein the linguistic model further comprises: a language model communicatively coupled to the acoustic model to receive the phoneme data and to generate text data representing the utterance, wherein the language model is configured to receive the audio feature tensors and the visual feature tensors as an input for use in generating the text data representing the utterance. 13. A method for processing speech at a client device, the method comprising: capturing, at the client device, audio data associated with an utterance from a user; capturing, at the client device, image data featuring an environment of the user; extracting, using a visual feature extractor at the client device, a set of visual feature tensors from one or more frames of the image data, the set of visual feature tensors providing a compressed representation of the frames of image data; extracting, using an audio feature extractor at the client device, a set of audio feature tensors from the audio data; and transmitting, at the client device, the set of audio and visual feature tensors to a server device, the server device being configured to supply at least the visual feature tensors and the audio feature tensors to a linguistic model, the linguistic model being configured to determine a set of linguistic features that are usable to parse the utterance, wherein the visual feature extractor and the audio feature extractor are jointly configured with the linguistic model. 14. The method of claim 13 , comprising: receiving, at the client device, a response to the utterance from the server device; and providing, at the client device, a response to the user based on the response to the utterance received from the server device. 15. The method of claim 13 , wherein extracting, using the visual feature extractor, comprises: providing data derived from the captured image data t

Assignees

Inventors

Classifications

  • G10L15/24Primary

    Speech recognition using non-acoustical features · CPC title

  • using neural networks · CPC title

  • using classification, e.g. of video objects · CPC title

  • G10L15/22Primary

    Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • Validation; Performance evaluation; Active pattern learning techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11257493B2 cover?
Systems and methods for processing speech are described. In certain examples, image data is used to generate visual feature tensors and audio data is used to generate audio feature tensors. The visual feature tensors and the audio feature tensors are used by a linguistic model to determine linguistic features that are usable to parse an utterance of a user. The generation of the feature tensors…
Who is the assignee on this patent?
Soundhound Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/24. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 22 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).