Active interaction method, electronic device and readable storage medium

US11734392B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11734392-B2
Application numberUS-202117237978-A
CountryUS
Kind codeB2
Filing dateApr 22, 2021
Priority dateJul 20, 2020
Publication dateAug 22, 2023
Grant dateAug 22, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An active interaction method, an electronic device and a readable storage medium, relating to the field of deep learning and image processing technologies, are disclosed. According to an embodiment, the active interaction method includes: acquiring a video shot in real time; extracting a visual target from each image frame of the video, and generating a first feature vector of each visual target; for each image frame of the video, fusing the first feature vector of each visual target and identification information of the image frame to which the visual target belongs to generate a second feature vector of each visual target; aggregating the second feature vectors with the same identification information respectively to generate a third feature vector corresponding to each image frame; and initiating active interaction in response to determining that the active interaction is to be performed according to the third feature vector of a preset image frame.

First claim

Opening claim text (preview).

What is claimed is: 1. An active interaction method, comprising: acquiring a video shot in real time; extracting a visual target from each image frame of the video, and generating a first feature vector of each visual target; for each image frame of the video, fusing the first feature vector of each visual target and identification information of the image frame to which the visual target belongs, to generate a second feature vector of each visual target; aggregating the second feature vectors with the same identification information respectively to generate a third feature vector corresponding to each image frame; and initiating active interaction between an intelligent device and its surroundings in response to determining that the active interaction is to be performed according to the third feature vector of a preset image frame. 2. The method according to claim 1 , wherein extracting the visual target from each image frame of the video comprises: extracting a specific target from each image frame of the video as the visual target. 3. The method according to claim 1 , wherein extracting the visual target from each image frame of the video, and generating the first feature vector of each visual target comprises: annotating the visual target according to a feature map of each image frame; extracting a feature-map sub-region corresponding to the visual target from the feature map, and converting the feature-map sub-region into a sub-feature-map with a preset size; and performing a global average pooling operation on each sub-feature-map, so as to obtain the first feature vector corresponding to each visual target. 4. The method according to claim 3 , further comprising: after the first feature vector of each visual target is obtained, for each visual target, determining its coordinates at the upper left corner and its coordinates at the lower right corner in the image frame in a two-dimensional coordinate system with a center of the image frame as an origin; selecting a plurality of points from a range limited by the coordinates corresponding to each visual target in the image frame, so as to establish positional representation of each visual target in a two-dimensional plane; and flattening the established positional representation into a position feature vector with a preset number of dimensions, and then splicing the position feature vector with the first feature vector of each visual target. 5. The method according to claim 1 , wherein for each image frame of the video, fusing the first feature vector of each visual target and identification information of the image frame to which the visual target belongs to generate a second feature vector of each visual target comprises: for each image frame in the video, inputting the first feature vector of the visual target and the identification information of the image frame to which the visual target belongs into a pre-created neural network model; and taking an output result of the neural network model as the second feature vector of the visual target, wherein the neural network model comprises a plurality of decoder blocks, and each decoder block comprises a self-attention layer and a feed-forward layer. 6. The method according to claim 1 , wherein initiating active interaction comprises: acquiring feature vectors corresponding respectively to a plurality of multi-modal interaction modes; determining, from the plurality of multi-modal interaction modes, a multi-modal interaction mode to be employed when the active interaction is initiated, according to the third feature vector of the preset image frame and the feature vectors corresponding respectively to the plurality of multi-modal interaction modes; and performing the active interaction using the multi-modal interaction mode determined. 7. The method according to claim 6 , wherein acquiring the feature vectors corresponding respectively to the plurality of multi-modal interaction modes comprises: acquiring a semantic vector for characterizing each interaction statement by a pre-trained language model; acquiring one-hot codes for characterizing each interaction expression and each interaction action respectively; constructing different multi-modal interaction modes using different interaction languages, interaction expressions and interaction actions; and splicing the semantic vector and the one-hot codes corresponding to each multi-modal interaction mode, and inputting the splicing result into a fully-connected network, and taking an output result as the feature vector corresponding to each multi-modal interaction mode. 8. The method according to claim 6 , wherein the determining, from the plurality of multi-modal interaction modes, the multi-modal interaction mode employed when the active interaction is initiated according to the third feature vector of the preset image frame and the feature vectors corresponding respectively to the plurality of multi-modal interaction modes comprises: multiplying the third feature vector of the preset image frame respectively by the feature vectors corresponding respectively to the plurality of multi-modal interaction modes, and inputting the multiplication results into a pre-trained second discriminant model; and determining the multi-modal interaction mode employed when the active interaction is initiated according to an output result of the second discriminant model. 9. An electronic device, comprising: at least one processor; a memory connected with the at least one processor communicatively; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to carry out an active interaction method, which comprises: acquiring a video shot in real time; extracting a visual target from each image frame of the video, and generating a first feature vector of each visual target; for each image frame of the video, fusing the first feature vector of each visual target and identification information of the image frame to which the visual target belongs, to generate a second feature vector of each visual target; aggregating the second feature vectors with the same identification information respectively to generate a third feature vector corresponding to each image frame; and initiating active interaction between an intelligent device and its surroundings in response to determining that the active interaction is to be performed according to the third feature vector of a preset image frame. 10. The electronic device according to claim 9 , wherein extracting the visual target from each image frame of the video comprises: extracting a specific target from each image frame of the video as the visual target. 11. The electronic device according to claim 9 , wherein extracting the visual target from each image frame of the video, and generating the first feature vector of each visual target comprises: annotating the visual target according to a feature map of each image frame; extracting a feature-map sub-region corresponding to the visual target from the feature map, and converting the feature-map sub-region into a sub-feature-map with a preset size; and performing a global average pooling operation on each sub-feature-map, so as to obtain the first feature vector corresponding to each visual target. 12. The electronic device according to claim 11 , wherein the method further comprises: after the first feature vector of each visual target is obtained, for each visual target, determining its coordinates at the upper left corner and its coordinates at the lower right corner in the image frame in a two-dimensional coordinate system with a center of the i

Assignees

Inventors

Classifications

  • Energy efficient computing, e.g. low power processors, power management or thermal management · CPC title

  • G06F18/40Primary

    Software arrangements specially adapted for pattern recognition, e.g. user interfaces or toolboxes therefor · CPC title

  • based on discrimination criteria, e.g. discriminant analysis · CPC title

  • of extracted features · CPC title

  • Semantic analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11734392B2 cover?
An active interaction method, an electronic device and a readable storage medium, relating to the field of deep learning and image processing technologies, are disclosed. According to an embodiment, the active interaction method includes: acquiring a video shot in real time; extracting a visual target from each image frame of the video, and generating a first feature vector of each visual targe…
Who is the assignee on this patent?
Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F18/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 22 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).