Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics

US10679063B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10679063-B2
Application numberUS-201514846318-A
CountryUS
Kind codeB2
Filing dateSep 4, 2015
Priority dateApr 23, 2012
Publication dateJun 9, 2020
Grant dateJun 9, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computing system for recognizing salient events depicted in a video utilizes learning algorithms to detect audio and visual features of the video. The computing system identifies one or more salient events depicted in the video based on the audio and visual features.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computing system for understanding content of a video, the computing system configured to: algorithmically extract one or more visual features, one or more audio features and one or more textual features from the video using a plurality of detection algorithms; determine an audio concept evidenced by the one or more extracted audio features; determine a visual concept evidenced by the one or more extracted visual and audio features; access a knowledge base that defines events and maintains a mapping of relationships between different combinations of visual features, audio features, textual features, audio concepts and visual concepts with different ones of the events, wherein the mapping is based on semantic descriptions of the features and concepts defined by a plurality of models trained using one or more machine learning techniques; with the knowledge base, determine a semantic relationship between at least the audio concept and the visual concept as defined by the mapping; and recognize a salient event depicted in at least a portion of the video based at least partly on the semantic relationship between the audio concept and the visual concept and an event definition in the knowledge base. 2. The computing system of claim 1 , configured to algorithmically detect at least one prosodic feature of an audio segment of the video and determine the audio concept based on the at least one prosodic feature. 3. The computing system of claim 1 , configured to recognize the salient event based on a prosodic feature. 4. The computing system of claim 1 , configured to segment the video into a plurality of temporal segments, and determine the audio concept based on audio features extracted from at least one of the segments. 5. The computing system of claim 1 , configured to incorporate the portion of the video depicting the salient event into a video clip. 6. The computing system of claim 5 , configured to communicate the video clip to a computing device over a network. 7. The computing system of claim 5 , configured to one or more of: (i) interactively edit the video clip and (ii) automatically edit the video clip. 8. The computing system of claim 1 , configured to generate a natural language description of the salient event. 9. The computing system of claim 1 , configured to determine the audio concept using one or more audio concept detectors trained by a machine-learning technique or to determine the visual concept using one or more visual concept detectors trained by a machine-learning technique or recognize the salient event using one or more detectors trained by a machine-learning technique. 10. A method for understanding content of a video, the method comprising, by a computing system comprising one or more computing devices: algorithmically extracting one or more visual features and one or more audio features from a video; determining an audio concept evidenced by the one or more extracted audio features; determining a visual concept evidenced by the one or more extracted visual features; accessing a knowledge base that defines events and maintains a mapping of relationships between different combinations of visual features, audio features, textual features, audio concepts, and visual concepts with different ones of the events, wherein the mapping is based on semantic descriptions of the features and concepts defined by a plurality of models trained using one or more machine learning techniques; with the knowledge base, determining a semantic relationship between at least the audio concept and the visual concept as defined by the mapping; recognizing a plurality of salient events depicted in at least a portion of the video based at last partly on the semantic relationship between the audio concept and the visual concept and an event definition in the knowledge base; and arranging the salient events according to a saliency criterion. 11. The method of claim 10 , comprising detecting at least one prosodic feature of an audio segment of the video and determine the audio concept based on the at least one prosodic feature. 12. The method of claim 10 , comprising recognizing the salient event based on a prosodic feature. 13. The method of claim 10 , comprising segmenting the video into a plurality of temporal segments, and determining the audio concept based on audio features extracted from at least one of the segments. 14. The method of claim 10 , comprising incorporating the portion of the video depicting the salient event into a video clip. 15. The method of claim 14 , configured to communicate the video clip to another computing device over a network. 16. The method of claim 14 , comprising one or more of: (i) interactively editing the video clip and (ii) automatically editing the video clip. 17. The method of claim 10 , comprising generating a natural language description of the salient event. 18. One or more non-transitory machine accessible storage media comprising instructions executable by at least one processor to: algorithmically extract one or more visual features and one or more audio features from a video; determine an audio concept evidenced by the one or more extracted audio features; determine a visual concept evidenced by the one or more extracted visual features; access a knowledge base that defines events and maintains a mapping of relationships between different combinations of visual features, audio features, audio concepts, and visual concepts with different ones of the events, wherein the mapping is based on semantic descriptions of the features and concepts defined by a plurality of models trained using one or more machine learning techniques; with the knowledge base, determine a semantic relationship between at least the audio concept and the visual concept as defined by the mapping; recognize a salient event depicted in at least a portion of the video based at least partly on the semantic relationship between the audio concept and the visual concept and an event definition in the knowledge base; and generate a visual presentation of the salient event according to a pre-defined presentation template. 19. The one or more machine accessible storage media of claim 18 , comprising instructions to detect at least one prosodic feature of an audio segment of the video and recognize the salient event based on the at least one prosodic feature. 20. The method of claim 18 , comprising incorporating the portion of the video depicting the salient event into a video clip. 21. The method of claim 20 , configured to communicate the video clip to a computing device over a network.

Assignees

Inventors

Classifications

  • for generating different versions · CPC title

  • Querying · CPC title

  • G06F16/78Primary

    Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title

  • involving operations for analysing video streams, e.g. detecting features or characteristics (television picture signal circuitry for scene change detection H04N5/147; filtering for image enhancement G06T5/00; methods or arrangements for recognising scenes G06V20/00; arrangements characterised by components specially adapted for monitoring, identification or recognition of video in broadcast systems H04H60/59) · CPC title

  • Electronic editing of digitised analogue information signals, e.g. audio or video signals · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10679063B2 cover?
A computing system for recognizing salient events depicted in a video utilizes learning algorithms to detect audio and visual features of the video. The computing system identifies one or more salient events depicted in the video based on the audio and visual features.
Who is the assignee on this patent?
Stanford Res Inst Int
What technology area does this patent fall under?
Primary CPC classification G06F16/78. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 09 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).