System and method for continuous multimodal speech and gesture interaction

US9710223B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9710223-B2
Application numberUS-201514875105-A
CountryUS
Kind codeB2
Filing dateOct 5, 2015
Priority dateDec 1, 2011
Publication dateJul 18, 2017
Grant dateJul 18, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein are systems, methods, and non-transitory computer-readable storage media for processing multimodal input. A system configured to practice the method continuously monitors an audio stream associated with a gesture input stream, and detects a speech event in the audio stream. Then the system identifies a temporal window associated with a time of the speech event, and analyzes data from the gesture input stream within the temporal window to identify a gesture event. The system processes the speech event and the gesture event to produce a multimodal command. The gesture in the gesture input stream can be directed to a display, but is remote from the display. The system can analyze the data from the gesture input stream by calculating an average of gesture coordinates within the temporal window.

First claim

Opening claim text (preview).

We claim: 1. A method comprising: monitoring an audio stream associated with a non-tactile gesture input stream; identifying a speech event, in the audio stream, from a first user; determining a temporal window associated with a time of the speech event, wherein the temporal window extends forward and backward from the time of the speech event; analyzing, via a processor, data from the non-tactile gesture input stream within the temporal window to identify, based on the speech event, a non-tactile gesture event; identifying clarifying information, in the audio stream, about the speech event from a second user; applying the clarifying information to the speech event to yield a clarification; and processing, based on the clarification, the speech event and the non-tactile gesture event to produce a multimodal command. 2. The method of claim 1 , wherein a non-tactile gesture in the non-tactile gesture input stream is directed to a display, but is remote from the display. 3. The method of claim 1 , wherein analyzing of the data from the non-tactile gesture input stream further comprises calculating an average of non-tactile gesture coordinates within the temporal window. 4. The method of claim 1 , wherein the speech event comprises a speech command and wherein processing of the speech event and the non-tactile gesture event further comprises: identifying parameters from the non-tactile gesture event; and applying the parameters and the clarifying information to the speech command. 5. The method of claim 4 , wherein a gesture filtering module focuses the temporal window based on a timing of specific words in the speech event and the speech command. 6. The method of claim 1 , further comprising executing the multimodal command. 7. The method of claim 1 , wherein one of a length and a position of the temporal window is based on a type of the speech event. 8. The method of claim 1 , wherein the speech event is detected in the audio stream without an explicit user activation via one of a button press and a touch gesture. 9. The method of claim 1 , wherein the non-tactile gesture input stream comprises input to one of a motion detector, a motion capture system, a camera, and an infrared camera. 10. The method of claim 1 , wherein the audio stream comprises input from one of a microphone and an array of microphones. 11. A system comprising: a processor; and a computer-readable storage medium having instructions stored which, when executed by the processor, cause the processor to perform operations comprising: monitoring an audio stream associated with a non-tactile gesture input stream; identifying a speech event, in the audio stream, from a first user; determining a temporal window associated with a time of the speech event, wherein the temporal window extends forward and backward from the time of the speech event; analyzing data from the non-tactile gesture input stream within the temporal window to identify, based on the speech event, a non-tactile gesture event; identifying clarifying information, in the audio stream, about the speech event from a second user; applying the clarifying information to the speech event to yield a clarification; and processing, based on the clarification, the speech event and the non-tactile gesture event to produce a multimodal command. 12. The system of claim 11 , wherein a non-tactile gesture in the non-tactile gesture input stream is directed to a display, but is remote from the display. 13. The system of claim 11 , wherein analyzing of the data from the non-tactile gesture input stream further comprises calculating an average of non-tactile gesture coordinates within the temporal window. 14. The system of claim 11 , wherein the speech event comprises a speech command and wherein processing of the speech event and the non-tactile gesture event further comprises: identifying parameters from the non-tactile gesture event; and applying the parameters and the clarifying information to the speech command. 15. The system of claim 14 , wherein a gesture filtering module focuses the temporal window based on a timing of specific words in the speech event and the speech command. 16. The system of claim 11 , further comprising executing the multimodal command. 17. The system of claim 11 , wherein one of a length and a position of the temporal window is based on a type of the speech event. 18. The system of claim 11 , wherein the speech event is detected in the audio stream without an explicit user activation via one of a button press and a touch gesture. 19. The system of claim 11 , wherein the non-tactile gesture input stream comprises input to one of a motion detector, a motion capture system, a camera, and an infrared camera. 20. A computer-readable storage device having instructions stored which, when executed by a computing device, cause the computing device to perform operations comprising: monitoring an audio stream associated with a non-tactile gesture input stream; identifying a speech event, in the audio stream, from a first user; determining a temporal window associated with a time of the speech event, wherein the temporal window extends forward and backward from the time of the speech event; analyzing data from the non-tactile gesture input stream within the temporal window to identify, based on the speech event, a non-tactile gesture event; identifying clarifying information, in the audio stream, about the speech event from a second user; applying the clarifying information to the speech event to yield a clarification; and processing, based on the clarification, the speech event and the non-tactile gesture event to produce a multimodal command.

Assignees

Inventors

Classifications

  • Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer · CPC title

  • Gesture based interaction, e.g. based on a set of recognized hand gestures (interaction based on gestures traced on a digitiser G06F3/04883) · CPC title

  • G06F3/167Primary

    Audio in a user interface, e.g. using voice commands for navigating, audio feedback · CPC title

  • Execution procedure of a spoken command · CPC title

  • G10L15/22Primary

    Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9710223B2 cover?
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for processing multimodal input. A system configured to practice the method continuously monitors an audio stream associated with a gesture input stream, and detects a speech event in the audio stream. Then the system identifies a temporal window associated with a time of the speech event, and analyzes dat…
Who is the assignee on this patent?
Nuance Communications Inc
What technology area does this patent fall under?
Primary CPC classification G06F3/167. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 18 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).