Self-supervised image depth estimation method based on channel self-attention mechanism

US12482122B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12482122-B2
Application numberUS-202318397990-A
CountryUS
Kind codeB2
Filing dateDec 27, 2023
Priority dateAug 30, 2023
Publication dateNov 25, 2025
Grant dateNov 25, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed is a method for constructing a bionic binocular active visual simultaneous localization and mapping (SLAM) system. The method includes: S1, constructing a binocular active visual system, peripheral vision is simulated using a panoramic camera, a panoramic value map is constructed based on the panoramic camera, and a binocular bionic eye gaze control strategy based on the panoramic value map is proposed; S2, designing a SLAM algorithm framework which is suitable for motion of the two bionic eyes, where relative pose optimization of left and right cameras based on a Perspective-n-Point algorithm and depth calculation for binocular matching points based on triangulation are adopted in a SLAM tracking thread; and S3, with the assistance of camera poses estimated by the SLAM system and sparse landmark point coordinates, training a self-supervised depth estimation network, to provide dense depth estimation for a mapping module of the SLAM system, achieving dense mapping.

First claim

Opening claim text (preview).

What is claimed is: 1 . A self-supervised image depth estimation method based on a channel self-attention mechanism, comprising: capturing images using binocular bionic eye cameras, wherein the binocular bionic eye cameras comprise a left bionic eye camera and a right bionic eye camera; defining one of the binocular bionic eye cameras as a primary camera, defining a primary camera image as a target image, and defining two frames of primary camera images before and after the target image and an image captured by an other one of the binocular bionic eye cameras as source images; inputting the images captured by the binocular bionic eye cameras into a simultaneous localization and mapping (SLAM) system, and predicting camera poses and a sparse depth map of the target image by using the SLAM system; inputting the target image and the sparse depth map that is obtained by a SLAM thread into a depth estimation network based on a channel self-attention mechanism architecture to obtain a scene depth map of the target image, wherein the depth estimation network based on the channel self-attention mechanism architecture comprises an encoder, a structure-aware module, and a decoder, and parameters of the depth estimation network based on the channel self-attention mechanism architecture are updated in the following manner: inputting the target image into the encoder, which uses a ResNet-18 network as a backbone to extract semantic features and then inputs the semantic into the structure-aware module to generate new features; inputting a new feature map generated by the structure-aware module into the decoder, wherein the decoder first performs a 3×3 convolution and upsampling on the new feature map generated by the structure-aware module, and then inputs the feature map into a detail-aware module, and the feature map obtained by the detail-aware module is further subject to two 1×1 convolutions and a sigmoid function; and obtaining a final dense depth map at original resolution after decoding; inputting the sparse depth map obtained by the SLAM system into a final layer structure of the decoder for pre-training the depth estimation network; and based on a relative pose between a target image frame and a neighboring frame or the other one of the binocular bionic eye cameras, projecting and reconstructing the target map by using the dense depth map of the target image predicted by the depth estimation network based on the channel self-attention mechanism and the relative pose between frames, then constructing a re-projection error between the target image and the reconstructed target image, and minimizing the re-projection error during training. 2 . The image depth estimation method according to claim 1 , wherein an operation process of the structure-aware module comprises: given a feature map F∈ C×H×W generated by the ResNet-18 encoder, first reshaping F into C×N , wherein N=H×W is the number of pixels, and then multiplying F with a transposed matrix of F to calculate a feature similarity S∈ C×C : S ij =F i ·F j T , wherein i and j represent any two channels, and S ij represents a feature similarity between the two channels; transforming the similarity matrix S into a distinctiveness matrix D∈ C×C through element-wise subtraction: D ij =max i (S)−S i,j , wherein D ij represents an influence of channel j on channel i; applying a softmax layer to obtain an attention map A∈ C×C : A ij = exp ⁡ ( D ij ) ∑ j = 1 C exp ⁡ ( D ij ) , wherein A ij represents concentrating attention on specific parts of the two channels, and extracting key information from the channels while ignoring irrelevant information; multiplying the attention map A and the transposed matrix of F, reshaping a result into C×H×W , and then performing element-wise summation between F and C×H×W to obtain a final output E∈ C×H×W : E i = ∑ j = 1 C ( A ij , F j ) + F i . 3 . The image depth estimation method according to claim 1 , wherein the detail-aware module restores the original resolution by fusing high-level features H and low-level features L from skip connections, and a specific operation process comprises: first concatenating the low-level features L and the high-level features H, and then applying a convolutional layer followed by batch normalization to obtain U to balance feature scales: U=σ(BN(W 1 ⊗f(L, H))) wherein f( ) represents concatenation, ⊗ represents a 3×3 or 1×1 convolution, BN represents batch normalization, and ReLU is used as the activation function σ( ); compressing U into a vector by global average pooling to obtain global context, and using two 1×1 convolutional layers and a sigmoid function to calculate a weight vector V∈ 1×1×C so as to recalibrate channel features and measure importance of the channel features: V = δ ⁡ ( W 2 ⊗ σ ⁡ ( W 3 ⊗ 1 H × W

Assignees

Inventors

Classifications

  • Depth or disparity estimation from stereoscopic image signals · CPC title

  • Camera pose · CPC title

  • Artificial neural networks [ANN] · CPC title

  • Training; Learning · CPC title

  • Scaling of whole images or parts thereof, e.g. expanding or contracting · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12482122B2 cover?
Disclosed is a method for constructing a bionic binocular active visual simultaneous localization and mapping (SLAM) system. The method includes: S1, constructing a binocular active visual system, peripheral vision is simulated using a panoramic camera, a panoramic value map is constructed based on the panoramic camera, and a binocular bionic eye gaze control strategy based on the panoramic val…
Who is the assignee on this patent?
Univ Jining, Univ Shanghai
What technology area does this patent fall under?
Primary CPC classification H04N23/698. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Nov 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).