System and method for controlling multidirectional operation of an elevator
US-2024425322-A1 · Dec 26, 2024 · US
US12236192B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12236192-B2 |
| Application number | US-202117339759-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 4, 2021 |
| Priority date | Jan 8, 2021 |
| Publication date | Feb 25, 2025 |
| Grant date | Feb 25, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system and method for generating task-specific text by processing multimodal inputs using machine-learning models is provided. The method may include accessing first sets of tokens associated with a desired task and one or more modalities associated with a context of the desired task. The method may further include determining a second set of tokens for each of the one or more modalities using a classifier network associated with the modality. The method may further include generating a number of embedding vectors by mapping the first sets of tokens and the second set of tokens associated with each of the one or more modalities to an embedding space. The method may further include producing a sequence of words addressing the desired task by processing the number of embedding vectors with an encoder-decoder network.
Opening claim text (preview).
What is claimed is: 1. A method comprising, by a computing device: accessing at least one first set of tokens associated with a desired task and one or more modalities associated with a context of the desired task; determining, for the one or more modalities, a second set of tokens using a classifier network associated with at least one modality; generating a plurality of embedding vectors comprising a first set of embedding vectors mapped to the at least one first set of tokens and a second set of embedding vectors mapped to the second set of tokens, the at least one first set of tokens and the second set of tokens associated with the one or more modalities, wherein the first set of embedding vectors and the second set of embedding vectors are different and are mapped to an embedding space; and producing a sequence of words addressing the desired task based on determining probability distributions of the words to determine whether to select the words for the sequence and based on processing the plurality of embedding vectors with an encoder-decoder network. 2. The method of claim 1 , wherein the desired task comprises a caption describing an event, an answer to a given question related to the event, a question relative to the event, or a context-aware dialog. 3. The method of claim 1 , wherein the one or more modalities comprise video sensor data, audio sensor data, Inertial Measurement Unit (IMU) sensor data, or light detection and ranging (lidar) sensor data. 4. The method of claim 1 , wherein the plurality of embedding vectors comprise a pre-determined relative position among the plurality of embedding vectors. 5. The method of claim 1 , wherein an encoder of the encoder-decoder network generates a latent representation by processing the plurality of embedding vectors. 6. The method of claim 5 , wherein a decoder of the encoder- decoder network produces a word at a time by processing the latent representation, wherein the produced word is selected from a word dictionary based on a probability associated with one or more words in the word dictionary. 7. The method of claim 6 , wherein the decoder takes a kth produced word as input for producing a k+1 st word in the sequence of words. 8. The method of claim 1 , wherein determining a set of tokens for one or more modalities comprises sampling one or more categories corresponding to a modality among a plurality of categories based on a probability distribution associated with the plurality of categories generated by processing the at least one modality with the classifier network associated with the at least one modality. 9. The method of claim 8 , wherein the sampling comprises a categorical reparameterization with Gumbel-Softmax or a differentiable approximation of tokenization. 10. The method of claim 8 , wherein sampling one or more categories corresponding to the modality among the plurality of categories based on a probability distribution for the plurality of categories is performed by a differentiable tokenization unit. 11. The method of claim 10 , wherein mapping a set of tokens belonging to a modality to one or more embedding vectors in a d-dimensional embedding space comprises looking up an embedding table corresponding to the at least one modality. 12. The method of claim 11 , wherein a first embedding table corresponding to a first modality is different from a second embedding table corresponding to a second modality. 13. The method of claim 11 , wherein, at a beginning of a training procedure, the classifier network associated with embedding tables, of the at least one modality, corresponding to the at least one first set of tokens and the second set of tokens, and the encoder-decoder network are initialized with pre-trained models. 14. The method of claim 13 , wherein, during the training procedure, the classifier network, the embedding tables, and the encoder-decoder network are updated through backward propagations. 15. The method of claim 14 , wherein, during the backward propagations, a gradient for the plurality of categories for the one or more modalities is estimated using a Straight-Through Estimator. 16. The method of claim 14 , wherein, a loss is calculated based on a comparison between a ground-truth sequence of words addressing the desired task with a sequence of words generated by the decoder of the encoder-decoder network. 17. The method of claim 16 , wherein a partial loss for a k th word is calculated based on a comparison between a k th word in the ground-truth sequence of words and a k th generated word in an instance in which a sub-sequence of words from a first word to a k−1 st word in the ground truth sequence of words is provided to the decoder as input. 18. One or more computer-readable non-transitory storage media embodying software that is operable when executed to: access at least one first set of tokens associated with a desired task and one or more modalities associated with a context of the desired task; determine, for the one or more modalities, a second set of tokens using a classifier network associated with at least one modality; generate a plurality of embedding vectors comprising a first set of embedding vectors mapped to the at least one first set of tokens and a second set of embedding vectors mapped to the second set of tokens, the at least one first set of tokens and the second set of tokens associated with the one or more modalities, wherein the first set of embedding vectors and the second set of embedding vectors are different and are mapped to an embedding space; and produce a sequence of words addressing the desired task based on determining probability distributions of the words to determine whether to select the words for the sequence and based on processing the plurality of embedding vectors with an encoder-decoder network. 19. The media of claim 18 , wherein the desired task comprises a caption describing an event, an answer to a given question related to the event, a question relative to the event, or a context-aware dialog. 20. A system comprising: one or more processors; and a non-transitory memory coupled to the one or more processors comprising instructions executable by the one or more processors, the one or more processors operable when executing the instructions to: access at least one first set of tokens associated with a desired task and one or more modalities associated with a context of the desired task; determine, for the one or more modalities, a second set of tokens using a classifier network associated with at least one modality; generate a plurality of embedding vectors comprising a first set of embedding vectors mapped to the at least one first set of tokens and a second set of embedding vectors mapped to the second set of tokens, the at least one first set of tokens and the second set of tokens associated with the one or more modalities, wherein the first set of embedding vectors and the second set of embedding vectors are different and are mapped to an embedding space; and produce a sequence of words addressing the desired task based on determining probability distributions of the words to determine whether to select the words for the sequence and based on processing the plurality of embedding vectors with an encoder-decoder network.
Auto-encoder networks; Encoder-decoder networks · CPC title
Supervised learning · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Semantic analysis · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.