Efficient two-stream network system and method for isolated sign language recognition using accumulative video motion

US12469333B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12469333-B2
Application numberUS-202318460897-A
CountryUS
Kind codeB2
Filing dateSep 5, 2023
Priority dateSep 5, 2023
Publication dateNov 11, 2025
Grant dateNov 11, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A sign language recognition system is described. The system includes a motion sensor, a processing circuitry and a display device. The motion sensor captures and records a dynamic sign language gesture as a sign video stream. The processing circuitry is configured with a key postures extractor, an accumulative video motion (AVM), and a sign recognition network (SRN). The key postures extractor captures main postures of the dynamic sign language gesture by extracting key frames in the sign video stream. The AVM captures motion of the sign video stream frames and transforms the motion in an AVM frame into a single AVM image. The SRN is configured as a convolutional network. The main postures and AVM image are fed into a two-stream network. The features from the two stream network are concatenated and fed into the SRN for learning fused features and performing classification of the sign language gesture.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A sign language recognition system, comprising: a motion sensor for capturing and recording a dynamic sign language gesture as a sign video stream; processing circuitry configured with a key postures extractor that captures main postures of the dynamic sign language gesture by extracting key frames in the sign video stream; an accumulative video motion (AVM) module that captures motion of the key frames and transforms the motion in an AVM frame into a single AVM image while preserving spatiotemporal information of the sign language gesture; and a sign recognition network (SRN) configured as a convolutional network, wherein the main postures and AVM image are fed into a two-stream network, and wherein features from the two-stream network are concatenated and fed into the SRN for learning fused features and performing classification of the sign language gesture; and a display device that outputs the classification as a natural language word. 2 . The system of claim 1 , wherein the key postures extractor includes extracting the key frames by employing hand trajectories captured by tracking hand joint points; preprocessing the joint points by smoothing hand locations using a median filter to remove outlier joint points; extracting the key frames by connecting the hand locations during signing to form a polygon, wherein sharp changes in hand locations are represented as vertices of the polygon; and iteratively repeating a reduction algorithm to recompute importance of remaining vertices until N vertices remain in the polygon to obtain a reduced trajectory. 3 . The system of claim 1 , wherein a stream of the two-stream network is a dynamic motion network (DMN) that uses the main postures to learn the preserved spatiotemporal information of the sign language gesture. 4 . The system of claim 3 , wherein, in the DMN, the extracted key frames are fed into a combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network to learn and extract the preserved spatiotemporal information from the key frame of the sign language gesture. 5 . The system of claim 1 , wherein a stream of the two-stream network is an accumulative motion network (AMN) that learns the motion in the AVM image. 6 . The system of claim 5 , wherein the AVM image is fed into the AMN that uses a convolutional neural network (CNN) network fine-tuned on a pre-trained network. 7 . The system of claim 5 , wherein the AMN utilizes an accumulated summation between the key frames and produces an RGB image representing a whole sign. 8 . The system of claim 1 , wherein the processing circuitry is further configured to preserve the spatiotemporal information of the dynamic sign language gesture by fusing the sign's main postures in forward and backward directions to generate the AVM image. 9 . A method of recognizing sign language, comprising: capturing and recording, via a motion sensor, a dynamic sign language gesture as a sign video stream; capturing, via a key postures extractor, main postures of the dynamic sign language gesture by extracting key frames in the sign video stream; capturing, via an accumulative video motion (AVM) module, motion of the key frames and transforming the motion in an AVM frame into a single AVM image while preserving spatiotemporal information of the sign language gesture; feeding the main postures and AVM image into a two-stream network; concatenating features from the two stream network; feeding the concatenated features into a SRN for learning fused features; performing classification of the sign language gesture; and outputting, via a display device, the classification as a natural language word. 10 . The method of claim 9 , further comprising, via the key postures extractor, extracting the key frames by employing hand trajectories captured by tracking hand joint points; preprocessing the joint points by smoothing hand locations using a median filter to remove outlier joint points; extracting the key frames by connecting the hand locations during signing to form a polygon, wherein sharp changes in hand locations are represented as vertices of the polygon; and iteratively repeating a reduction algorithm to recompute importance of remaining vertices until N vertices remain in the polygon to obtain a reduced trajectory. 11 . The method of claim 9 , further comprising: learning, via a dynamic motion network (DMN), the preserved spatiotemporal information of the sign language gesture. 12 . The method of claim 11 , further comprising: in the DMN, feeding the extracted key frames into a combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network to learn and extract the preserved spatiotemporal information from the key frame of the sign language gesture. 13 . The method of claim 9 , further comprising learning, via an accumulative motion network (AMN), the motion in the AVM image. 14 . The method of claim 13 , further comprising: feeding the AVM image into the AMN that uses a convolutional neural network (CNN) network fine-tuned on a pre-trained network. 15 . The method of claim 13 , further comprising: computing an accumulated summation between the key frames; the AMN utilizing the accumulated summation between the key frames and producing an RGB image representing a whole sign. 16 . The method of claim 9 , further comprising: preserving the spatiotemporal information of the dynamic sign language gesture by fusing the sign's main postures in forward and backward directions to generate the AVM image. 17 . A non-transitory computer readable storage medium storing program instructions, which when executed by computing circuitry, perform a method of recognizing sign language, comprising: capturing and recording, via a motion sensor, a dynamic sign language gesture as a sign video stream; capturing, via a key postures extractor, main postures of the dynamic sign language gesture by extracting key frames in the sign video stream; capturing, via an accumulative video motion (AVM) module, motion of the key frames and transforming the motion in an AVM frame into a single AVM image while preserving spatiotemporal information of the sign language gesture; feeding the main postures and AVM image into a two-stream network; concatenating features from the two stream network; feeding the concatenated features into a SRN for learning fused features; performing classification of the sign language gesture; and outputting, via a display device, the classification as a natural language word. 18 . The storage medium of claim 17 , further comprising, via the key postures extractor, extracting the key frames by employing hand trajectories captured by tracking hand joint points; preprocessing the joint points by smoothing hand locations using a median filter to remove outlier joint points; extracting the key frames by connecting the hand locations during signing to form a polygon, wherein sharp changes in hand locations are represented as vertices of the polygon; and iteratively repeating a reduction algorithm to recompute importance of remaining vertices until N vertices remain in the polygon to obtain a reduced trajectory. 19 . The storage medium of claim 17 , further comprising: in a dynamic motion network (DMN), feeding the extracted key frames into a combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network to learn and extract the preserved spatiotemporal in

Assignees

Inventors

Classifications

  • using neural networks · CPC title

  • G06V20/40Primary

    in video content (extracting overlay text G06V20/62; video retrieval G06F16/70; processing of video elementary streams in video servers H04N21/234; processing of video elementary streams in video clients H04N21/44) · CPC title

  • Smoothing or thinning of the pattern; Morphological operations; Skeletonisation · CPC title

  • G06V40/28Primary

    Recognition of hand or arm movements, e.g. recognition of deaf sign language (static hand signs G06V40/113) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12469333B2 cover?
A sign language recognition system is described. The system includes a motion sensor, a processing circuitry and a display device. The motion sensor captures and records a dynamic sign language gesture as a sign video stream. The processing circuitry is configured with a key postures extractor, an accumulative video motion (AVM), and a sign recognition network (SRN). The key postures extractor …
Who is the assignee on this patent?
Univ King Fahd Pet & Minerals, Saudi Data And Artificial Intelligence Authority Sdaia
What technology area does this patent fall under?
Primary CPC classification G06V20/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 11 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).