Multi-gesture text input prediction
US-2015082229-A1 · Mar 19, 2015 · US
US12406316B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12406316-B2 |
| Application number | US-202318185258-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 16, 2023 |
| Priority date | Apr 20, 2018 |
| Publication date | Sep 2, 2025 |
| Grant date | Sep 2, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In one embodiment, a method includes receiving at a head-mounted device a speech input from a user and a visual input captured by cameras of the head-mounted device, wherein the visual input comprises subjects and attributes associated with the subjects, and wherein the speech input comprises a co-reference to one or more of the subjects, resolving entities corresponding to the subjects associated with the co-reference based on the attributes and the co-reference, and presenting a communication content responsive to the speech input and the visual input at the head-mounted device, wherein the communication content comprises information associated with executing results of tasks corresponding to the resolved entities.
Opening claim text (preview).
What is claimed is: 1. A method comprising, by a client system: receiving, at the client system, a speech input from a user and a visual input captured by one or more cameras of the client system, wherein the speech input comprises a textual input spoken by the user, and wherein the visual input depicts a real-time view captured by the one or more cameras, wherein the real-time view comprises one or more visual concepts and one or more attributes associated with the one or more visual concepts, and wherein the textual input comprises a co-reference to one or more of the visual concepts; resolving, based on the one or more attributes and the co-reference, one or more entities corresponding to the one or more visual concepts associated with the co-reference; and presenting, at the client system, a communication content responsive to the speech input and the visual input, wherein the communication content comprises information associated with executing results of one or more tasks corresponding to the one or more resolved entities. 2. The method of claim 1 , wherein the speech input comprises one or more audio clips, and wherein the visual input comprises one or more of an image or a video clip. 3. The method of claim 1 , further comprising: determining, based on the visual input, the one or more visual concepts and the one or more attributes associated with the one or more visual concepts. 4. The method of claim 3 , wherein determining the one or more visual concepts and the one or more attributes is based on one or more machine-learning models comprising one or more of: a support vector machine; a regression model; or a convolutional neural network. 5. The method of claim 1 , wherein the one or more visual concepts comprise one or more of a person, a location, a business, or an object. 6. The method of claim 5 , wherein the one or more visual concepts comprise one or more people, and wherein the method further comprises: determining the one or more people based on facial recognition. 7. The method of claim 5 , wherein the one or more visual concepts comprise one or more objects, and wherein the method further comprises: determining the one or more objects based on object detection. 8. The method of claim 1 , further comprising generating a feature representation for the visual input. 9. The method of claim 1 , further comprising identifying one or more intents and one or more slots based on one or more of the speech input or the visual input. 10. The method of claim 9 , further comprising executing the one or more tasks based on the identified intents and slots. 11. The method of claim 1 , wherein the communication content comprises one or more of: a character string; an audio clip; an image; or a video clip. 12. The method of claim 1 , further comprising determining one or more modalities for the communication content. 13. The method of claim 12 , wherein determining the one or more modalities for the communication content comprises: identifying contextual information associated with a user of the client system; identifying contextual information associated with the client system; and determining the one or more modalities based on the contextual information associated with the user and the contextual information associated with the client system. 14. The method of claim 1 , further comprising: generating a plurality of tasks based on the visual input; and receiving, at the client system, a user selection of the one or more tasks from the plurality of tasks by a user of the client system. 15. The method of claim 1 , further comprising storing the one or more visual concepts in a dialog state. 16. The method of claim 1 , further comprising: generating, based on the visual input, a media content object portraying the one or more visual concepts; receiving, at the client system, a user interaction with the media content object; and executing one or more additional tasks responsive to the user interaction with the media content object. 17. One or more computer-readable non-transitory non- volatile storage media embodying software that is operable when executed by a client system to: receive, at a client system, a speech input from a user and a visual input captured by one or more cameras of the client system, wherein the speech input comprises a textual input spoken by the user, and wherein the visual input depicts a real-time view captured by the one or more cameras, wherein the real-time view comprises one or more visual concepts and one or more attributes associated with the one or more visual concepts, and wherein the textual input comprises a co-reference to one or more of the visual concepts; resolve, based on the one or more attributes and the co-reference, one or more entities corresponding to the one or more visual concepts associated with the co-reference; and present, at the client system, a communication content responsive to the speech input and the visual input, wherein the communication content comprises information associated with executing results of one or more tasks corresponding to the one or more resolved entities. 18. A client system comprising: one or more processors; and a non-transitory non-volatile memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to: receive, at a client system, a speech input from a user and a visual input captured by one or more cameras of the client system, wherein the speech input comprises a textual input spoken by the user, and wherein the visual input depicts a real-time view captured by the one or more cameras, wherein the real-time view comprises one or more visual concepts and one or more attributes associated with the one or more visual concepts, and wherein the textual input comprises a co-reference to one or more of the visual concepts; resolve, based on the one or more attributes and the co-reference, one or more entities corresponding to the one or more visual concepts associated with the co-reference; and present, at the client system, a communication content responsive to the speech input and the visual input, wherein the communication content comprises information associated with executing results of one or more tasks corresponding to the one or more resolved entities.
Business processes related to social networking or social networking services · CPC title
Commands or executable codes · CPC title
Channels characterised by the type of signal · CPC title
Protecting personal data, e.g. for financial or medical purposes · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.