Who is the assignee on this patent?

Inspur Suzhou Intelligent Technology Co Ltd, Suzhou Metabrain Intelligent Technology Co Ltd

What technology area does this patent fall under?

Primary CPC classification G06V10/72. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Dec 26 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Visual positioning method and apparatus, device, and medium

US2024428555A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2024428555-A1
Application number	US-202218716409-A
Country	US
Kind code	A1
Filing date	Sep 28, 2022
Priority date	Apr 19, 2022
Publication date	Dec 26, 2024
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present application relates to the technical field of artificial intelligence, and discloses a visual positioning method and apparatus, a device, and a medium. The method includes: performing feature splicing on an image encoding feature and a text encoding feature; performing feature fusion on spliced encoding features to obtain a first fused encoding feature; performing noise correction on the first fused encoding feature and the text encoding feature on the basis of a preset cross-attention mechanism to obtain a corrected fused feature and a corrected text encoding feature, and performing feature fusion on the spliced encoding feature and the corrected text encoding feature to obtain a second fused encoding feature; and correcting a preset frame feature using a target encoding feature on the basis of the corrected fused feature and the second fused encoding feature to predict a regional position coordinate of a target visual object.

First claim

Opening claim text (preview).

1 . A visual positioning method, comprising: encoding a target image and a target text, and performing feature splicing on an image encoding feature and a text encoding feature obtained after the encoding to obtain spliced encoding features; performing image-text feature fusion on the spliced encoding features using a preset image-text feature fusion unit constructed on the basis of a preset self-attention mechanism to obtain a first fused encoding feature; performing image-text noise correction on the first fused encoding feature and the text encoding feature using a preset noise correction unit to obtain a corrected fused feature and a corrected text encoding feature, the preset noise correction unit being a unit constructed on the basis of the preset self-attention mechanism and a preset cross-attention mechanism; inputting the spliced encoding feature and the corrected text encoding feature into the preset image-text feature fusion unit to obtain a second fused encoding feature; and correcting a preset frame feature through a preset target frame correction unit using a target encoding feature determined on the basis of the corrected fused feature and the second fused encoding feature, and predicting a regional position coordinate of a target visual object on the target image using a corrected frame feature, wherein the preset target frame correction unit is a unit constructed on the basis of the preset self-attention mechanism and the preset cross-attention mechanism. 2 . The visual positioning method according to claim 1 , wherein before the performing image-text feature fusion on the spliced encoding features using a preset image-text feature fusion unit constructed on the basis of a preset self-attention mechanism, the method further comprises: constructing an image-text feature fusion sub-unit using a self-attention operation unit, a layer normalization unit, a feature deletion unit, and a feature addition unit which are constructed on the basis of the preset self-attention mechanism; and concatenating a first preset number of the image-text feature fusion sub-units successively to construct and obtain the preset image-text feature fusion unit; wherein the performing image-text feature fusion on the spliced encoding features using a preset image-text feature fusion unit constructed on the basis of a preset self-attention mechanism to obtain a first fused encoding feature comprises: taking the first image-text feature fusion sub-unit in the preset image-text feature fusion unit as a current image-text feature fusion sub-unit, and taking the spliced encoding feature as a feature to be processed; inputting the feature to be processed into the current image-text feature fusion sub-unit; performing a self-attention operation, a layer normalization operation, a feature deletion operation, and a feature addition operation successively on the feature to be processed using the current image-text feature fusion sub-unit to obtain a corresponding current operation processing result; and obtaining the first fused encoding feature according to the current operation processing result. 3 . The visual positioning method according to claim 2 , wherein the obtaining the first fused encoding feature according to the current operation processing result comprises: determining whether the current image-text feature fusion sub-unit is the last one; in response to the current image-text feature fusion sub-unit not being the last one, updating the current image-text feature fusion sub-unit to a next image-text feature fusion sub-unit, updating the feature to be processed to the current operation processing result, and returning to perform the inputting the feature to be processed into the current image-text feature fusion sub-unit; and in response to the current image-text feature fusion sub-unit being the last one, taking the current operation processing result as the first fused encoding feature. 4 . The visual positioning method according to claim 2 , wherein the performing a self-attention operation, a layer normalization operation, a feature deletion operation, and a feature addition operation successively on the feature to be processed using the current image-text feature fusion sub-unit to obtain a corresponding current operation processing result comprises: performing the self-attention operation on the feature to be processed using the self-attention operation unit in the current image-text feature fusion sub-unit to obtain a first operation feature; performing layer normalization on the first operation feature using the layer normalization unit in the current image-text feature fusion sub-unit to obtain a second operation feature; performing the feature deletion operation on the second operation feature using the feature deletion unit in the current image-text feature fusion sub-unit according to a preset proportion to obtain a third operation feature; and performing the feature addition operation on the third operation feature and the feature to be processed using the feature addition unit in the current image-text feature fusion sub-unit to obtain the operation processing result in the current image-text feature fusion sub-unit. 5 . The visual positioning method according to claim 1 , wherein before the performing image-text noise correction on the first fused encoding feature and the text encoding feature using a preset noise correction unit, the method further comprises: constructing a first noise correction sub-unit using a self-attention operation unit, a feature deletion unit, a layer normalization unit, and a feature addition unit which are constructed on the basis of the preset self-attention mechanism; constructing a second noise correction sub-unit using a cross-attention operation unit, a feature deletion unit, a layer normalization unit, and a feature addition unit which are constructed on the basis of the preset cross-attention mechanism; and concatenating the first noise correction sub-unit and a second preset number of the second noise correction sub-units successively to construct and obtain the preset noise correction unit. 6 . The visual positioning method according to claim 5 , wherein the performing image-text noise correction on the first fused encoding feature and the text encoding feature using a preset noise correction unit to obtain a corrected fused feature and a corrected text encoding feature comprises: inputting the first fused encoding feature and the text encoding feature into the first noise correction sub-unit in the preset noise correction unit, so as to perform a self-attention operation, a feature deletion operation, a layer normalization operation, and a feature addition operation on both of the first fused encoding feature and the text encoding feature to obtain first operation processing results corresponding to each of the first fused encoding feature and the text encoding feature; taking the first second noise correction sub-unit in the preset noise correction unit as a current second noise correction sub-unit, and taking the first operation processing results corresponding to each of the first fused encoding feature and the text encoding feature as current features to be processed; inputting the feature to be processed into the current second noise correction sub-unit; performing a cross-attention operation, the feature deletion operation, the layer normalization operation, and the feature addition operation successively on the feature to be processed using the current second noise correction sub-unit to obtain current second operation processing results corresponding to each of the first fused encoding feature and the text encoding feature; and obtaining the corrected text encoding feature according to the current second operation processing

Assignees

Inventors

Classifications

G06F18/253
of extracted features · CPC title
G06V10/72Primary
Data preparation, e.g. statistical preprocessing of image or video features · CPC title
G06V10/82
using neural networks · CPC title
G06V10/806Primary
of extracted features · CPC title
G06T2207/20081
Training; Learning · CPC title

Patent family

Related publications grouped by family.

View patent family 81554650

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2024428555A1 cover?: The present application relates to the technical field of artificial intelligence, and discloses a visual positioning method and apparatus, a device, and a medium. The method includes: performing feature splicing on an image encoding feature and a text encoding feature; performing feature fusion on spliced encoding features to obtain a first fused encoding feature; performing noise correction o…
Who is the assignee on this patent?: Inspur Suzhou Intelligent Technology Co Ltd, Suzhou Metabrain Intelligent Technology Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06V10/72. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Dec 26 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).