Video frame action detection using gated history

US11895343B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11895343-B2
Application numberUS-202217852310-A
CountryUS
Kind codeB2
Filing dateJun 28, 2022
Priority dateJun 3, 2022
Publication dateFeb 6, 2024
Grant dateFeb 6, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Example solutions for video frame action detection use a gated history and include: receiving a video stream comprising a plurality of video frames; grouping the plurality of video frames into a set of present video frames and a set of historical video frames, the set of present video frames comprising a current video frame; determining a set of attention weights for the set of historical video frames, the set of attention weights indicating how informative a video frame is for predicting action in the current video frame; weighting the set of historical video frames with the set of attention weights to produce a set of weighted historical video frames; and based on at least the set of weighted historical video frames and the set of present video frames, generating an action prediction for the current video frame.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a processor; and a computer-readable medium storing instructions that are operative upon execution by the processor to: receive a video stream comprising a plurality of video frames; group the plurality of video frames into a set of present video frames and a set of historical video frames, the set of present video frames comprising a current video frame; determine a set of attention weights for the set of historical video frames, the set of attention weights indicating how informative a video frame is for predicting action in the current video frame; weight the set of historical video frames with the set of attention weights to produce a set of weighted historical video frames; and based on at least the set of weighted historical video frames and the set of present video frames, generate an action prediction for the current video frame. 2. The system of claim 1 , wherein the instructions are further operative to: based on at least the action prediction for the current video frame, generate an annotation for the current video frame; and display the current video frame subject to the annotation for the current video frame. 3. The system of claim 1 , wherein determining the set of attention weights comprises: determining, for each video frame of the set of historical video frames, a position-guided gating score. 4. The system of claim 1 , wherein the plurality of video frames comprises a set of history frames and, for each individual history frame in the set of history frames, a set of subsequently-observed video frames, wherein the set of subsequently-observed video frames is more recent than the individual history frame, and wherein the instructions are further operative to: based on at least the set of history frames and their sets of subsequently-observed video frames, extract features from the set of historical video frames; and encode the extracted features. 5. The system of claim 4 , wherein extracting features does not use optical flow. 6. The system of claim 1 , wherein the instructions are further operative to: perform background suppression, wherein the action prediction comprises a confidence and wherein performing the background suppression comprises: modifying the confidence, including by weighting low confidence video frames more heavily, with separate emphasis on action and background classes, for a classifier that generates the action prediction. 7. The system of claim 1 , wherein the action prediction comprises a no action prediction or an action class prediction selected from a plurality of action classes. 8. A computerized method comprising: receiving a video stream comprising a plurality of video frames; grouping the plurality of video frames into a set of present video frames and a set of historical video frames, the set of present video frames comprising a current video frame; determining a set of attention weights for the set of historical video frames, the set of attention weights indicating how informative a video frame is for predicting action in the current video frame; weighting the set of historical video frames with the set of attention weights to produce a set of weighted historical video frames; and based on at least the set of weighted historical video frames and the set of present video frames, generating an action prediction for the current video frame. 9. The method of claim 8 , further comprising: based on at least the action prediction for the current video frame, generating an annotation for the current video frame; and displaying the current video frame subject to the annotation for the current video frame. 10. The method of claim 8 , wherein determining the set of attention weights comprises: determining, for each video frame of the set of historical video frames, a position-guided gating score. 11. The method of claim 8 , wherein the plurality of video frames comprises a set of history frames and, for each history frame in the set of history frames, a set of subsequently-observed video frames, wherein the set of subsequently-observed video frames is more recent than the history frame, and wherein the method further comprises: based on at least the set of history frames and their sets of subsequently-observed video frames, extracting features from the set of historical video frames; and encoding the extracted features. 12. The method of claim 11 , wherein extracting features does not use optical flow. 13. The method of claim 8 , further comprising: performing background suppression, wherein the action prediction comprises a confidence and wherein performing the background suppression comprises: weighting low confidence video frames more heavily, with separate emphasis on action and background classes, for a classifier that generates the action prediction. 14. The method of claim 8 , wherein the action prediction comprises a no action prediction or an action class prediction selected from a plurality of action classes. 15. One or more computer storage devices having computer-executable instructions stored thereon, which, on execution by a computer, cause the computer to perform operations comprising: receiving a video stream comprising a plurality of video frames; grouping the plurality of video frames into a set of present video frames and a set of historical video frames, the set of present video frames comprising a current video frame; determining a set of attention weights for the set of historical video frames, the set of attention weights indicating how informative a video frame is for predicting action in the current video frame; weighting the set of historical video frames with the set of attention weights to produce a set of weighted historical video frames; and based on at least the set of weighted historical video frames and the set of present video frames, generating an action prediction for the current video frame. 16. The one or more computer storage devices of claim 15 , wherein the operations further comprise: based on at least the action prediction for the current video frame, generating an annotation for the current video frame; and displaying the current video frame subject to the annotation for the current video frame. 17. The one or more computer storage devices of claim 15 , wherein determining the set of attention weights comprises: determining, for each video frame of the set of historical video frames, a position-guided gating score. 18. The one or more computer storage devices of claim 15 , wherein the plurality of video frames comprises a set of history frames and, for each history frame in the set of history frames, a set of subsequently-observed video frames, wherein the set of subsequently-observed video frames is more recent than the history frame, and wherein the operations further comprise: based on at least the set of history frames and their sets of subsequently-observed video frames, extracting features from the set of historical video frames; and encoding the extracted features. 19. The one or more computer storage devices of claim 15 , wherein the operations further comprise: performing background suppression, wherein the action prediction comprises a confidence and wherein performing the background suppression comprises: weighting low confidence video frames more heavily, with separate emphasis on action and background classes, for a classifier that generates the action prediction. 20. The one or more computer storage devices of cla

Assignees

Inventors

Classifications

  • Surveillance or monitoring of activities, e.g. for recognising suspicious objects (recognising microscopic objects G06V20/69) · CPC title

  • Recognition of whole body movements, e.g. for sport training · CPC title

  • using neural networks · CPC title

  • of news video content · CPC title

  • G06V20/46Primary

    Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11895343B2 cover?
Example solutions for video frame action detection use a gated history and include: receiving a video stream comprising a plurality of video frames; grouping the plurality of video frames into a set of present video frames and a set of historical video frames, the set of present video frames comprising a current video frame; determining a set of attention weights for the set of historical video…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06V20/46. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 06 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).