Who is the assignee on this patent?

Suzhou Metabrain Intelligent Technology Co Ltd

What technology area does this patent fall under?

Primary CPC classification G06V10/806. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 13 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Visual positioning method and apparatus, device, and medium

US12299953B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12299953-B2
Application number	US-202218716409-A
Country	US
Kind code	B2
Filing date	Sep 28, 2022
Priority date	Apr 19, 2022
Publication date	May 13, 2025
Grant date	May 13, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present application relates to the technical field of artificial intelligence, and discloses a visual positioning method and apparatus, a device, and a medium. The method includes: performing feature splicing on an image encoding feature and a text encoding feature; performing feature fusion on spliced encoding features to obtain a first fused encoding feature; performing noise correction on the first fused encoding feature and the text encoding feature on the basis of a preset cross-attention mechanism to obtain a corrected fused feature and a corrected text encoding feature, and performing feature fusion on the spliced encoding feature and the corrected text encoding feature to obtain a second fused encoding feature; and correcting a preset frame feature using a target encoding feature on the basis of the corrected fused feature and the second fused encoding feature to predict a regional position coordinate of a target visual object.

First claim

Opening claim text (preview).

What is claimed is: 1. A visual positioning method, comprising: encoding a target image and a target text, and performing feature splicing on an image encoding feature and a text encoding feature obtained after the encoding to obtain spliced encoding features; performing image-text feature matching on the spliced encoding features using a preset image-text feature matching unit constructed on the basis of a preset self-attention mechanism to obtain a first fused encoding feature; performing image-text error correction on the first fused encoding feature and the text encoding feature using a preset error correction unit to obtain a corrected fused feature and a corrected text encoding feature, the preset error correction unit being a unit constructed on the basis of the preset self-attention mechanism and a preset cross-attention mechanism; inputting the spliced encoding feature and the corrected text encoding feature into the preset image-text feature matching unit to obtain a second fused encoding feature; and correcting a preset frame feature through a preset target frame correction unit using a target encoding feature determined on the basis of the corrected fused feature and the second fused encoding feature, and predicting a regional position coordinate of a target visual object on the target image using a corrected frame feature, wherein the preset target frame correction unit is a unit constructed on the basis of the preset self-attention mechanism and the preset cross-attention mechanism. 2. The visual positioning method according to claim 1 , wherein before the performing image-text feature matching on the spliced encoding features using a preset image-text feature matching unit constructed on the basis of a preset self-attention mechanism, the method further comprises: constructing an image-text feature matching sub-unit using a self-attention operation unit, a layer normalization unit, a feature deletion unit, and a feature addition unit which are constructed on the basis of the preset self-attention mechanism; and concatenating a first preset number of the image-text feature matching sub-units successively to construct and obtain the preset image-text feature matching unit; wherein the performing image-text feature matching on the spliced encoding features using a preset image-text feature matching unit constructed on the basis of a preset self-attention mechanism to obtain a first fused encoding feature comprises: taking the first image-text feature matching sub-unit in the preset image-text feature matching unit as a current image-text feature matching-sub-unit, and taking the spliced encoding feature as a feature to be processed; inputting the feature to be processed into the current image-text feature matching sub-unit; performing a self-attention operation, a layer normalization operation, a feature deletion operation, and a feature addition operation successively on the feature to be processed using the current image-text feature matching sub-unit to obtain a corresponding current operation processing result; and obtaining the first fused encoding feature according to the current operation processing result. 3. The visual positioning method according to claim 2 , wherein the obtaining the first fused encoding feature according to the current operation processing result comprises: determining whether the current image-text feature matching sub-unit is the last one; in response to the current image-text feature matching sub-unit not being the last one, updating the current image-text feature matching sub-unit to a next image-text feature matching sub-unit, updating the feature to be processed to the current operation processing result, and returning to perform the inputting the feature to be processed into the current image-text feature matching sub-unit; and in response to the current image-text feature matching sub-unit being the last one, taking the current operation processing result as the first fused encoding feature. 4. The visual positioning method according to claim 2 , wherein the performing a self-attention operation, a layer normalization operation, a feature deletion operation, and a feature addition operation successively on the feature to be processed using the current image-text feature matching sub-unit to obtain a corresponding current operation processing result comprises: performing the self-attention operation on the feature to be processed using the self-attention operation unit in the current image-text feature matching sub-unit to obtain a first operation feature; performing layer normalization on the first operation feature using the layer normalization unit in the current image-text feature matching sub-unit to obtain a second operation feature; performing the feature deletion operation on the second operation feature using the feature deletion unit in the current image-text feature matching sub-unit according to a preset proportion to obtain a third operation feature; and performing the feature addition operation on the third operation feature and the feature to be processed using the feature addition unit in the current image-text feature matching sub-unit to obtain the operation processing result in the current image-text feature matching sub-unit. 5. The visual positioning method according to claim 1 , wherein before the performing image-text error correction on the first fused encoding feature and the text encoding feature using a preset error correction unit, the method further comprises: constructing a first error correction sub-unit using a self-attention operation unit, a feature deletion unit, a layer normalization unit, and a feature addition unit which are constructed on the basis of the preset self-attention mechanism; constructing a second error correction sub-unit using a cross-attention operation unit, a feature deletion unit, a layer normalization unit, and a feature addition unit which are constructed on the basis of the preset cross-attention mechanism; and concatenating the first error correction sub-unit and a second preset number of the second error correction sub-units successively to construct and obtain the preset error correction unit. 6. The visual positioning method according to claim 5 , wherein the performing image-text error correction on the first fused encoding feature and the text encoding feature using a preset error correction unit to obtain a corrected fused feature and a corrected text encoding feature comprises: inputting the first fused encoding feature and the text encoding feature into the first error correction sub-unit in the preset error correction unit, so as to perform a self-attention operation, a feature deletion operation, a layer normalization operation, and a feature addition operation on both of the first fused encoding feature and the text encoding feature to obtain first operation processing results corresponding to each of the first fused encoding feature and the text encoding feature; taking the first second error correction sub-unit in the preset error correction unit as a current second error correction sub-unit, and taking the first operation processing results corresponding to each of the first fused encoding feature and the text encoding feature as current features to be processed; inputting the feature to be processed into the current second error correction sub-unit; performing a cross-attention operation, the feature deletion operation, the layer normalization operation, and the feature addition operation successively on the feature to be processed using the current second error correction sub-unit to obtain current second operation processing results corresponding to each of the first fused encoding feature and the text encoding feature; and obtaining the corrected tex

Assignees

Suzhou Metabrain Intelligent Technology Co Ltd

Inventors

Classifications

G06V10/82
using neural networks · CPC title
G06V10/806Primary
of extracted features · CPC title
G06F18/253
of extracted features · CPC title
G06V10/72Primary
Data preparation, e.g. statistical preprocessing of image or video features · CPC title
G06T2207/20081
Training; Learning · CPC title

Patent family

Related publications grouped by family.

View patent family 81554650

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12299953B2 cover?: The present application relates to the technical field of artificial intelligence, and discloses a visual positioning method and apparatus, a device, and a medium. The method includes: performing feature splicing on an image encoding feature and a text encoding feature; performing feature fusion on spliced encoding features to obtain a first fused encoding feature; performing noise correction o…
Who is the assignee on this patent?: Suzhou Metabrain Intelligent Technology Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06V10/806. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 13 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Systems and methods for vision-language distribution alignment

Method and apparatus for real-world cross-modal retrieval problems

Align-to-ground, weakly supervised phrase grounding guided by image-caption alignment

Stacked cross-modal matching

Frequently asked questions