Streamlining an automatic visual inspection process
US-12141959-B2 · Nov 12, 2024 · US
US2022019847A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2022019847-A1 |
| Application number | US-202117237978-A |
| Country | US |
| Kind code | A1 |
| Filing date | Apr 22, 2021 |
| Priority date | Jul 20, 2020 |
| Publication date | Jan 20, 2022 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An active interaction method, an electronic device and a readable storage medium, relating to the field of deep learning and image processing technologies, are disclosed. According to an embodiment, the active interaction method includes: acquiring a video shot in real time; extracting a visual target from each image frame of the video, and generating a first feature vector of each visual target; for each image frame of the video, fusing the first feature vector of each visual target and identification information of the image frame to which the visual target belongs to generate a second feature vector of each visual target; aggregating the second feature vectors with the same identification information respectively to generate a third feature vector corresponding to each image frame; and initiating active interaction in response to determining that the active interaction is to be performed according to the third feature vector of a preset image frame.
Opening claim text (preview).
What is claimed is: 1 . An active interaction method, comprising: acquiring a video shot in real time; extracting a visual target from each image frame of the video, and generating a first feature vector of each visual target; for each image frame of the video, fusing the first feature vector of each visual target and identification information of the image frame to which the visual target belongs, to generate a second feature vector of each visual target; aggregating the second feature vectors with the same identification information respectively to generate a third feature vector corresponding to each image frame; and initiating active interaction in response to determining that the active interaction is to be performed according to the third feature vector of a preset image frame. 2 . The method according to claim 1 , wherein extracting the visual target from each image frame of the video comprises: extracting a specific target from each image frame of the video as the visual target. 3 . The method according to claim 1 , wherein extracting the visual target from each image frame of the video, and generating the first feature vector of each visual target comprises: annotating the visual target according to a feature map of each image frame; extracting a feature-map sub-region corresponding to the visual target from the feature map, and converting the feature-map sub-region into a sub-feature-map with a preset size; and performing a global average pooling operation on each sub-feature-map, so as to obtain the first feature vector corresponding to each visual target. 4 . The method according to claim 3 , further comprising: after the first feature vector of each visual target is obtained, for each visual target, determining its coordinates at the upper left corner and its coordinates at the lower right corner in the image frame in a two-dimensional coordinate system with a center of the image frame as an origin; selecting a plurality of points from a range limited by the coordinates corresponding to each visual target in the image frame, so as to establish positional representation of each visual target in a two-dimensional plane; and flattening the established positional representation into a position feature vector with a preset number of dimensions, and then splicing the position feature vector with the first feature vector of each visual target. 5 . The method according to claim 1 , wherein for each image frame of the video, fusing the first feature vector of each visual target and identification information of the image frame to which the visual target belongs to generate a second feature vector of each visual target comprises: for each image frame in the video, inputting the first feature vector of the visual target and the identification information of the image frame to which the visual target belongs into a pre-created neural network model; and taking an output result of the neural network model as the second feature vector of the visual target, wherein the neural network model comprises a plurality of decoder blocks, and each decoder block comprises a self-attention layer and a feed-forward layer. 6 . The method according to claim 1 , wherein initiating active interaction comprises: acquiring feature vectors corresponding respectively to a plurality of multi-modal interaction modes; determining, from the plurality of multi-modal interaction modes, a multi-modal interaction mode to be employed when the active interaction is initiated, according to the third feature vector of the preset image frame and the feature vectors corresponding respectively to the plurality of multi-modal interaction modes; and performing the active interaction using the multi-modal interaction mode determined. 7 . The method according to claim 6 , wherein acquiring the feature vectors corresponding respectively to the plurality of multi-modal interaction modes comprises: acquiring a semantic vector for characterizing each interaction statement by a pre-trained language model; acquiring one-hot codes for characterizing each interaction expression and each interaction action respectively; constructing different multi-modal interaction modes using different interaction languages, interaction expressions and interaction actions; and splicing the semantic vector and the one-hot codes corresponding to each multi-modal interaction mode, and inputting the splicing result into a fully-connected network, and taking an output result as the feature vector corresponding to each multi-modal interaction mode. 8 . The method according to claim 6 , wherein the determining, from the plurality of multi-modal interaction modes, the multi-modal interaction mode employed when the active interaction is initiated according to the third feature vector of the preset image frame and the feature vectors corresponding respectively to the plurality of multi-modal interaction modes comprises: multiplying the third feature vector of the preset image frame respectively by the feature vectors corresponding respectively to the plurality of multi-modal interaction modes, and inputting the multiplication results into a pre-trained second discriminant model; and determining the multi-modal interaction mode employed when the active interaction is initiated according to an output result of the second discriminant model. 9 . An electronic device, comprising: at least one processor; a memory connected with the at least one processor communicatively; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to carry out an active interaction method, which comprises: acquiring a video shot in real time; extracting a visual target from each image frame of the video, and generating a first feature vector of each visual target; for each image frame of the video, fusing the first feature vector of each visual target and identification information of the image frame to which the visual target belongs, to generate a second feature vector of each visual target; aggregating the second feature vectors with the same identification information respectively to generate a third feature vector corresponding to each image frame; and initiating active interaction in response to determining that the active interaction is to be performed according to the third feature vector of a preset image frame. 10 . The electronic device according to claim 9 , wherein extracting the visual target from each image frame of the video comprises: extracting a specific target from each image frame of the video as the visual target. 11 . The electronic device according to claim 9 , wherein extracting the visual target from each image frame of the video, and generating the first feature vector of each visual target comprises: annotating the visual target according to a feature map of each image frame; extracting a feature-map sub-region corresponding to the visual target from the feature map, and converting the feature-map sub-region into a sub-feature-map with a preset size; and performing a global average pooling operation on each sub-feature-map, so as to obtain the first feature vector corresponding to each visual target. 12 . The electronic device according to claim 11 , wherein the method further comprises: after the first feature vector of each visual target is obtained, for each visual target, determining its coordinates at the upper left corner and its coordinates at the lower right corner in the image frame in a two-dimensional coordinate system with a center of the image frame as an origin; selecting a plurality of points from a
of extracted features · CPC title
based on the proximity to a decision surface, e.g. support vector machines · CPC title
Software arrangements specially adapted for pattern recognition, e.g. user interfaces or toolboxes therefor · CPC title
based on discrimination criteria, e.g. discriminant analysis · CPC title
Contour-based spatial representations, e.g. vector-coding · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.