Robot for preventing interruption while interacting with user
US-12169410-B2 · Dec 17, 2024 · US
US10317992B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10317992-B2 |
| Application number | US-201414496538-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 25, 2014 |
| Priority date | Sep 25, 2014 |
| Publication date | Jun 11, 2019 |
| Grant date | Jun 11, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Improving accuracy in understanding and/or resolving references to visual elements in a visual context associated with a computerized conversational system is described. Techniques described herein leverage gaze input with gestures and/or speech input to improve spoken language understanding in computerized conversational systems. Leveraging gaze input and speech input improves spoken language understanding in conversational systems by improving the accuracy by which the system can resolve references—or interpret a user's intent—with respect to visual elements in a visual context. In at least one example, the techniques herein describe tracking gaze to generate gaze input, recognizing speech input, and extracting gaze features and lexical features from the user input. Based at least in part on the gaze features and lexical features, user utterances directed to visual elements in a visual context can be resolved.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: identifying a plurality of visual elements available for user interaction in a visual context on a display; receiving speech input including one or more words spoken by a user; extracting lexical features from the speech input; computing, for each visual element of the plurality of visual elements, a lexical similarity between the lexical features and the respective visual element of the plurality of visual elements and a lexical probability for each lexical similarity; receiving, from a tracking component, a gaze input; determining, from the gaze input, a heat map representing a probabilistic model of objects the user is looking at in the visual context on the display, the objects including the plurality of visual elements; determining that a particular visual element of the plurality of visual elements is an intended visual element of the speech input using a combination of a lexical probability of the lexical probabilities and the heat map; determining, by one or more processors, that the speech input comprises a command directed to the particular visual element; and causing an action associated with the particular visual element to be performed. 2. A computer-implemented method as claim 1 recites, wherein the visual context is a free-form web browser or an application interface. 3. A computer-implemented method as claim 1 recites, further comprising receiving head pose input associated with the particular visual element, wherein the head pose input serves as a proxy for the gaze input. 4. A computer-implemented method as recited in claim 1 , wherein using the combination of the lexical probability and the heat map includes: determining an area around each visual element of the plurality of visual elements on the display, each area not intersecting other areas of the determined areas; and determining distances from each area to fixation points associated with the heat map. 5. A computer-implemented method as claim 4 recites, further comprising: filtering the individual visual elements based at least in part on the respective calculated probabilities; identifying one or more visual elements that have respective probabilities above a predetermined threshold; and identifying the particular visual element from the one or more visual elements. 6. A computer-implemented method as claim 1 recites, further comprising: identifying a plurality of fixation points associated with the gaze input; grouping a predetermined number of the plurality of fixation points together in a cluster; and identifying a centroid of the cluster as a specific fixation point for extracting gaze features from the gaze input, the gaze features useable to determine that the gaze input is associated with the particular visual element. 7. A computer-implemented method as claim 6 recites, further comprising: computing a start time and an end time of the speech input; and extracting the gaze features based at least in part on: distances between the specific fixation point and an area associated with individual visual elements of the plurality of visual elements; the start time of the speech input; and the end time of the speech input. 8. A computer-implemented method as claim 1 recites, wherein the action comprises one of a selection of the particular visual element or entry of information into the particular visual element. 9. A device comprising: one or more processors; computer-readable media encoded with instructions that, when executed by the one or more processors, configure the device to perform acts comprising: identifying a plurality of visual elements for receiving user interaction in a visual context on a display; determining a user utterance transcribed from speech input comprising one or more words spoken in a particular language, the user utterance comprising a command to perform an action; receiving, from an eye tracking component, gaze input; determining, from the gaze input, a heat map representing a probabilistic model of objects the user is looking at in the visual context on the display, the objects including the plurality of visual elements; extracting lexical features based at least in part on the user utterance; computing, for each visual element of the plurality of visual elements, a lexical similarity between the lexical features and the respective visual element of the plurality of visual elements and a lexical probability for each lexical similarity; extracting gaze features based at least in part on the heat map; and determining that the command to perform the action is directed to an intended visual element using a combination of a lexical probability of the lexical probabilities and the gaze features. 10. A device as recited in claim 9 , wherein the acts further comprise determining a bounding box for individual visual elements of the plurality of visual elements, the bounding box comprising an area associated with the individual visual elements. 11. A device as recited in claim 10 , wherein the extracting the gaze features comprises computing distances between bounding boxes for the individual visual elements and fixation points associated with the gaze input at predetermined times. 12. A device as recited in claim 9 , wherein computing the lexical similarity includes computing a lexical similarity between the one or more words and text associated with individual visual elements of the plurality of visual elements. 13. A device as recited in claim 9 , wherein the determining that the command to perform the action is directed the intended visual element comprises classifying the plurality of visual elements based at least in part on applying a binary classifier to at least one of the lexical features or the gaze features. 14. A system comprising: an eye tracking sensor; a display; computer-readable media; one or more processors; and modules stored on the computer-readable media and executable by the one or more processors, the modules comprising: a receiving module configured to receive: speech input comprising one or more words referring to a particular visual element of a plurality of visual elements presented on a user interface of the display; and gaze input from the tracking component, the gaze input directed to one or more of the plurality of visual elements presented on the user interface; an extraction module configured to: determine, from the gaze input, a heat map representing a probabilistic model of objects a user is looking at in a visual context on the display, the objects including the plurality of visual elements; extract lexical features from the speech input; compute, for each visual element of the plurality of visual elements, a lexical similarity between the extracted lexical features and the respective visual element of the plurality of visual elements; and an analysis module configured to compute a lexical probability for each lexical similarity and to identify the particular visual element using a combination of a lexical probability of the lexical probabilities and the heat map. 15. A system as claim 14 recites, wherein the extraction module is configured to determine, using the heat map, a gaze probability for each visual element to be a subject of gaze by the user, and the analysis module is configured to identify the particular visual element using a combination of the lexical probability and the gaze probability for each visual element. 16. A system as claim 14 recites, wherein the extraction module configured to compute, for each visu
with means for monitoring data relating to the user, e.g. head-tracking, eye-tracking · CPC title
Audio in a user interface, e.g. using voice commands for navigating, audio feedback · CPC title
Eye tracking input arrangements (G06F3/015 takes precedence) · CPC title
Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer · CPC title
Head tracking input arrangements · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.