Systems and methods for vision-language distribution alignment
US-2023162490-A1 · May 25, 2023 · US
US12299953B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12299953-B2 |
| Application number | US-202218716409-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 28, 2022 |
| Priority date | Apr 19, 2022 |
| Publication date | May 13, 2025 |
| Grant date | May 13, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present application relates to the technical field of artificial intelligence, and discloses a visual positioning method and apparatus, a device, and a medium. The method includes: performing feature splicing on an image encoding feature and a text encoding feature; performing feature fusion on spliced encoding features to obtain a first fused encoding feature; performing noise correction on the first fused encoding feature and the text encoding feature on the basis of a preset cross-attention mechanism to obtain a corrected fused feature and a corrected text encoding feature, and performing feature fusion on the spliced encoding feature and the corrected text encoding feature to obtain a second fused encoding feature; and correcting a preset frame feature using a target encoding feature on the basis of the corrected fused feature and the second fused encoding feature to predict a regional position coordinate of a target visual object.
Opening claim text (preview).
What is claimed is: 1. A visual positioning method, comprising: encoding a target image and a target text, and performing feature splicing on an image encoding feature and a text encoding feature obtained after the encoding to obtain spliced encoding features; performing image-text feature matching on the spliced encoding features using a preset image-text feature matching unit constructed on the basis of a preset self-attention mechanism to obtain a first fused encoding feature; performing image-text error correction on the first fused encoding feature and the text encoding feature using a preset error correction unit to obtain a corrected fused feature and a corrected text encoding feature, the preset error correction unit being a unit constructed on the basis of the preset self-attention mechanism and a preset cross-attention mechanism; inputting the spliced encoding feature and the corrected text encoding feature into the preset image-text feature matching unit to obtain a second fused encoding feature; and correcting a preset frame feature through a preset target frame correction unit using a target encoding feature determined on the basis of the corrected fused feature and the second fused encoding feature, and predicting a regional position coordinate of a target visual object on the target image using a corrected frame feature, wherein the preset target frame correction unit is a unit constructed on the basis of the preset self-attention mechanism and the preset cross-attention mechanism. 2. The visual positioning method according to claim 1 , wherein before the performing image-text feature matching on the spliced encoding features using a preset image-text feature matching unit constructed on the basis of a preset self-attention mechanism, the method further comprises: constructing an image-text feature matching sub-unit using a self-attention operation unit, a layer normalization unit, a feature deletion unit, and a feature addition unit which are constructed on the basis of the preset self-attention mechanism; and concatenating a first preset number of the image-text feature matching sub-units successively to construct and obtain the preset image-text feature matching unit; wherein the performing image-text feature matching on the spliced encoding features using a preset image-text feature matching unit constructed on the basis of a preset self-attention mechanism to obtain a first fused encoding feature comprises: taking the first image-text feature matching sub-unit in the preset image-text feature matching unit as a current image-text feature matching-sub-unit, and taking the spliced encoding feature as a feature to be processed; inputting the feature to be processed into the current image-text feature matching sub-unit; performing a self-attention operation, a layer normalization operation, a feature deletion operation, and a feature addition operation successively on the feature to be processed using the current image-text feature matching sub-unit to obtain a corresponding current operation processing result; and obtaining the first fused encoding feature according to the current operation processing result. 3. The visual positioning method according to claim 2 , wherein the obtaining the first fused encoding feature according to the current operation processing result comprises: determining whether the current image-text feature matching sub-unit is the last one; in response to the current image-text feature matching sub-unit not being the last one, updating the current image-text feature matching sub-unit to a next image-text feature matching sub-unit, updating the feature to be processed to the current operation processing result, and returning to perform the inputting the feature to be processed into the current image-text feature matching sub-unit; and in response to the current image-text feature matching sub-unit being the last one, taking the current operation processing result as the first fused encoding feature. 4. The visual positioning method according to claim 2 , wherein the performing a self-attention operation, a layer normalization operation, a feature deletion operation, and a feature addition operation successively on the feature to be processed using the current image-text feature matching sub-unit to obtain a corresponding current operation processing result comprises: performing the self-attention operation on the feature to be processed using the self-attention operation unit in the current image-text feature matching sub-unit to obtain a first operation feature; performing layer normalization on the first operation feature using the layer normalization unit in the current image-text feature matching sub-unit to obtain a second operation feature; performing the feature deletion operation on the second operation feature using the feature deletion unit in the current image-text feature matching sub-unit according to a preset proportion to obtain a third operation feature; and performing the feature addition operation on the third operation feature and the feature to be processed using the feature addition unit in the current image-text feature matching sub-unit to obtain the operation processing result in the current image-text feature matching sub-unit. 5. The visual positioning method according to claim 1 , wherein before the performing image-text error correction on the first fused encoding feature and the text encoding feature using a preset error correction unit, the method further comprises: constructing a first error correction sub-unit using a self-attention operation unit, a feature deletion unit, a layer normalization unit, and a feature addition unit which are constructed on the basis of the preset self-attention mechanism; constructing a second error correction sub-unit using a cross-attention operation unit, a feature deletion unit, a layer normalization unit, and a feature addition unit which are constructed on the basis of the preset cross-attention mechanism; and concatenating the first error correction sub-unit and a second preset number of the second error correction sub-units successively to construct and obtain the preset error correction unit. 6. The visual positioning method according to claim 5 , wherein the performing image-text error correction on the first fused encoding feature and the text encoding feature using a preset error correction unit to obtain a corrected fused feature and a corrected text encoding feature comprises: inputting the first fused encoding feature and the text encoding feature into the first error correction sub-unit in the preset error correction unit, so as to perform a self-attention operation, a feature deletion operation, a layer normalization operation, and a feature addition operation on both of the first fused encoding feature and the text encoding feature to obtain first operation processing results corresponding to each of the first fused encoding feature and the text encoding feature; taking the first second error correction sub-unit in the preset error correction unit as a current second error correction sub-unit, and taking the first operation processing results corresponding to each of the first fused encoding feature and the text encoding feature as current features to be processed; inputting the feature to be processed into the current second error correction sub-unit; performing a cross-attention operation, the feature deletion operation, the layer normalization operation, and the feature addition operation successively on the feature to be processed using the current second error correction sub-unit to obtain current second operation processing results corresponding to each of the first fused encoding feature and the text encoding feature; and obtaining the corrected tex
using neural networks · CPC title
of extracted features · CPC title
of extracted features · CPC title
Data preparation, e.g. statistical preprocessing of image or video features · CPC title
Training; Learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.