Image object recognition method, apparatus, and computer device
US-2020334504-A1 · Oct 22, 2020 · US
US2024428555A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2024428555-A1 |
| Application number | US-202218716409-A |
| Country | US |
| Kind code | A1 |
| Filing date | Sep 28, 2022 |
| Priority date | Apr 19, 2022 |
| Publication date | Dec 26, 2024 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present application relates to the technical field of artificial intelligence, and discloses a visual positioning method and apparatus, a device, and a medium. The method includes: performing feature splicing on an image encoding feature and a text encoding feature; performing feature fusion on spliced encoding features to obtain a first fused encoding feature; performing noise correction on the first fused encoding feature and the text encoding feature on the basis of a preset cross-attention mechanism to obtain a corrected fused feature and a corrected text encoding feature, and performing feature fusion on the spliced encoding feature and the corrected text encoding feature to obtain a second fused encoding feature; and correcting a preset frame feature using a target encoding feature on the basis of the corrected fused feature and the second fused encoding feature to predict a regional position coordinate of a target visual object.
Opening claim text (preview).
1 . A visual positioning method, comprising: encoding a target image and a target text, and performing feature splicing on an image encoding feature and a text encoding feature obtained after the encoding to obtain spliced encoding features; performing image-text feature fusion on the spliced encoding features using a preset image-text feature fusion unit constructed on the basis of a preset self-attention mechanism to obtain a first fused encoding feature; performing image-text noise correction on the first fused encoding feature and the text encoding feature using a preset noise correction unit to obtain a corrected fused feature and a corrected text encoding feature, the preset noise correction unit being a unit constructed on the basis of the preset self-attention mechanism and a preset cross-attention mechanism; inputting the spliced encoding feature and the corrected text encoding feature into the preset image-text feature fusion unit to obtain a second fused encoding feature; and correcting a preset frame feature through a preset target frame correction unit using a target encoding feature determined on the basis of the corrected fused feature and the second fused encoding feature, and predicting a regional position coordinate of a target visual object on the target image using a corrected frame feature, wherein the preset target frame correction unit is a unit constructed on the basis of the preset self-attention mechanism and the preset cross-attention mechanism. 2 . The visual positioning method according to claim 1 , wherein before the performing image-text feature fusion on the spliced encoding features using a preset image-text feature fusion unit constructed on the basis of a preset self-attention mechanism, the method further comprises: constructing an image-text feature fusion sub-unit using a self-attention operation unit, a layer normalization unit, a feature deletion unit, and a feature addition unit which are constructed on the basis of the preset self-attention mechanism; and concatenating a first preset number of the image-text feature fusion sub-units successively to construct and obtain the preset image-text feature fusion unit; wherein the performing image-text feature fusion on the spliced encoding features using a preset image-text feature fusion unit constructed on the basis of a preset self-attention mechanism to obtain a first fused encoding feature comprises: taking the first image-text feature fusion sub-unit in the preset image-text feature fusion unit as a current image-text feature fusion sub-unit, and taking the spliced encoding feature as a feature to be processed; inputting the feature to be processed into the current image-text feature fusion sub-unit; performing a self-attention operation, a layer normalization operation, a feature deletion operation, and a feature addition operation successively on the feature to be processed using the current image-text feature fusion sub-unit to obtain a corresponding current operation processing result; and obtaining the first fused encoding feature according to the current operation processing result. 3 . The visual positioning method according to claim 2 , wherein the obtaining the first fused encoding feature according to the current operation processing result comprises: determining whether the current image-text feature fusion sub-unit is the last one; in response to the current image-text feature fusion sub-unit not being the last one, updating the current image-text feature fusion sub-unit to a next image-text feature fusion sub-unit, updating the feature to be processed to the current operation processing result, and returning to perform the inputting the feature to be processed into the current image-text feature fusion sub-unit; and in response to the current image-text feature fusion sub-unit being the last one, taking the current operation processing result as the first fused encoding feature. 4 . The visual positioning method according to claim 2 , wherein the performing a self-attention operation, a layer normalization operation, a feature deletion operation, and a feature addition operation successively on the feature to be processed using the current image-text feature fusion sub-unit to obtain a corresponding current operation processing result comprises: performing the self-attention operation on the feature to be processed using the self-attention operation unit in the current image-text feature fusion sub-unit to obtain a first operation feature; performing layer normalization on the first operation feature using the layer normalization unit in the current image-text feature fusion sub-unit to obtain a second operation feature; performing the feature deletion operation on the second operation feature using the feature deletion unit in the current image-text feature fusion sub-unit according to a preset proportion to obtain a third operation feature; and performing the feature addition operation on the third operation feature and the feature to be processed using the feature addition unit in the current image-text feature fusion sub-unit to obtain the operation processing result in the current image-text feature fusion sub-unit. 5 . The visual positioning method according to claim 1 , wherein before the performing image-text noise correction on the first fused encoding feature and the text encoding feature using a preset noise correction unit, the method further comprises: constructing a first noise correction sub-unit using a self-attention operation unit, a feature deletion unit, a layer normalization unit, and a feature addition unit which are constructed on the basis of the preset self-attention mechanism; constructing a second noise correction sub-unit using a cross-attention operation unit, a feature deletion unit, a layer normalization unit, and a feature addition unit which are constructed on the basis of the preset cross-attention mechanism; and concatenating the first noise correction sub-unit and a second preset number of the second noise correction sub-units successively to construct and obtain the preset noise correction unit. 6 . The visual positioning method according to claim 5 , wherein the performing image-text noise correction on the first fused encoding feature and the text encoding feature using a preset noise correction unit to obtain a corrected fused feature and a corrected text encoding feature comprises: inputting the first fused encoding feature and the text encoding feature into the first noise correction sub-unit in the preset noise correction unit, so as to perform a self-attention operation, a feature deletion operation, a layer normalization operation, and a feature addition operation on both of the first fused encoding feature and the text encoding feature to obtain first operation processing results corresponding to each of the first fused encoding feature and the text encoding feature; taking the first second noise correction sub-unit in the preset noise correction unit as a current second noise correction sub-unit, and taking the first operation processing results corresponding to each of the first fused encoding feature and the text encoding feature as current features to be processed; inputting the feature to be processed into the current second noise correction sub-unit; performing a cross-attention operation, the feature deletion operation, the layer normalization operation, and the feature addition operation successively on the feature to be processed using the current second noise correction sub-unit to obtain current second operation processing results corresponding to each of the first fused encoding feature and the text encoding feature; and obtaining the corrected text encoding feature according to the current second operation processing
of extracted features · CPC title
Data preparation, e.g. statistical preprocessing of image or video features · CPC title
using neural networks · CPC title
of extracted features · CPC title
Training; Learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.