Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06F3/013. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 11 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Eye gaze for spoken language understanding in multi-modal conversational interactions

US10317992B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10317992-B2
Application number	US-201414496538-A
Country	US
Kind code	B2
Filing date	Sep 25, 2014
Priority date	Sep 25, 2014
Publication date	Jun 11, 2019
Grant date	Jun 11, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Improving accuracy in understanding and/or resolving references to visual elements in a visual context associated with a computerized conversational system is described. Techniques described herein leverage gaze input with gestures and/or speech input to improve spoken language understanding in computerized conversational systems. Leveraging gaze input and speech input improves spoken language understanding in conversational systems by improving the accuracy by which the system can resolve references—or interpret a user's intent—with respect to visual elements in a visual context. In at least one example, the techniques herein describe tracking gaze to generate gaze input, recognizing speech input, and extracting gaze features and lexical features from the user input. Based at least in part on the gaze features and lexical features, user utterances directed to visual elements in a visual context can be resolved.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: identifying a plurality of visual elements available for user interaction in a visual context on a display; receiving speech input including one or more words spoken by a user; extracting lexical features from the speech input; computing, for each visual element of the plurality of visual elements, a lexical similarity between the lexical features and the respective visual element of the plurality of visual elements and a lexical probability for each lexical similarity; receiving, from a tracking component, a gaze input; determining, from the gaze input, a heat map representing a probabilistic model of objects the user is looking at in the visual context on the display, the objects including the plurality of visual elements; determining that a particular visual element of the plurality of visual elements is an intended visual element of the speech input using a combination of a lexical probability of the lexical probabilities and the heat map; determining, by one or more processors, that the speech input comprises a command directed to the particular visual element; and causing an action associated with the particular visual element to be performed. 2. A computer-implemented method as claim 1 recites, wherein the visual context is a free-form web browser or an application interface. 3. A computer-implemented method as claim 1 recites, further comprising receiving head pose input associated with the particular visual element, wherein the head pose input serves as a proxy for the gaze input. 4. A computer-implemented method as recited in claim 1 , wherein using the combination of the lexical probability and the heat map includes: determining an area around each visual element of the plurality of visual elements on the display, each area not intersecting other areas of the determined areas; and determining distances from each area to fixation points associated with the heat map. 5. A computer-implemented method as claim 4 recites, further comprising: filtering the individual visual elements based at least in part on the respective calculated probabilities; identifying one or more visual elements that have respective probabilities above a predetermined threshold; and identifying the particular visual element from the one or more visual elements. 6. A computer-implemented method as claim 1 recites, further comprising: identifying a plurality of fixation points associated with the gaze input; grouping a predetermined number of the plurality of fixation points together in a cluster; and identifying a centroid of the cluster as a specific fixation point for extracting gaze features from the gaze input, the gaze features useable to determine that the gaze input is associated with the particular visual element. 7. A computer-implemented method as claim 6 recites, further comprising: computing a start time and an end time of the speech input; and extracting the gaze features based at least in part on: distances between the specific fixation point and an area associated with individual visual elements of the plurality of visual elements; the start time of the speech input; and the end time of the speech input. 8. A computer-implemented method as claim 1 recites, wherein the action comprises one of a selection of the particular visual element or entry of information into the particular visual element. 9. A device comprising: one or more processors; computer-readable media encoded with instructions that, when executed by the one or more processors, configure the device to perform acts comprising: identifying a plurality of visual elements for receiving user interaction in a visual context on a display; determining a user utterance transcribed from speech input comprising one or more words spoken in a particular language, the user utterance comprising a command to perform an action; receiving, from an eye tracking component, gaze input; determining, from the gaze input, a heat map representing a probabilistic model of objects the user is looking at in the visual context on the display, the objects including the plurality of visual elements; extracting lexical features based at least in part on the user utterance; computing, for each visual element of the plurality of visual elements, a lexical similarity between the lexical features and the respective visual element of the plurality of visual elements and a lexical probability for each lexical similarity; extracting gaze features based at least in part on the heat map; and determining that the command to perform the action is directed to an intended visual element using a combination of a lexical probability of the lexical probabilities and the gaze features. 10. A device as recited in claim 9 , wherein the acts further comprise determining a bounding box for individual visual elements of the plurality of visual elements, the bounding box comprising an area associated with the individual visual elements. 11. A device as recited in claim 10 , wherein the extracting the gaze features comprises computing distances between bounding boxes for the individual visual elements and fixation points associated with the gaze input at predetermined times. 12. A device as recited in claim 9 , wherein computing the lexical similarity includes computing a lexical similarity between the one or more words and text associated with individual visual elements of the plurality of visual elements. 13. A device as recited in claim 9 , wherein the determining that the command to perform the action is directed the intended visual element comprises classifying the plurality of visual elements based at least in part on applying a binary classifier to at least one of the lexical features or the gaze features. 14. A system comprising: an eye tracking sensor; a display; computer-readable media; one or more processors; and modules stored on the computer-readable media and executable by the one or more processors, the modules comprising: a receiving module configured to receive: speech input comprising one or more words referring to a particular visual element of a plurality of visual elements presented on a user interface of the display; and gaze input from the tracking component, the gaze input directed to one or more of the plurality of visual elements presented on the user interface; an extraction module configured to: determine, from the gaze input, a heat map representing a probabilistic model of objects a user is looking at in a visual context on the display, the objects including the plurality of visual elements; extract lexical features from the speech input; compute, for each visual element of the plurality of visual elements, a lexical similarity between the extracted lexical features and the respective visual element of the plurality of visual elements; and an analysis module configured to compute a lexical probability for each lexical similarity and to identify the particular visual element using a combination of a lexical probability of the lexical probabilities and the heat map. 15. A system as claim 14 recites, wherein the extraction module is configured to determine, using the heat map, a gaze probability for each visual element to be a subject of gaze by the user, and the analysis module is configured to identify the particular visual element using a combination of the lexical probability and the gaze probability for each visual element. 16. A system as claim 14 recites, wherein the extraction module configured to compute, for each visu

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G02B27/0093
with means for monitoring data relating to the user, e.g. head-tracking, eye-tracking · CPC title
G06F3/167
Audio in a user interface, e.g. using voice commands for navigating, audio feedback · CPC title
G06F3/013Primary
Eye tracking input arrangements (G06F3/015 takes precedence) · CPC title
G06F2203/0381
Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer · CPC title
G06F3/012Primary
Head tracking input arrangements · CPC title

Patent family

Related publications grouped by family.

View patent family 54291650

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10317992B2 cover?: Improving accuracy in understanding and/or resolving references to visual elements in a visual context associated with a computerized conversational system is described. Techniques described herein leverage gaze input with gestures and/or speech input to improve spoken language understanding in computerized conversational systems. Leveraging gaze input and speech input improves spoken language …
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06F3/013. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 11 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).