Immersive media content presentation and interactive 360° video communication
US-2024323337-A1 · Sep 26, 2024 · US
US2020293783A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2020293783-A1 |
| Application number | US-201916352605-A |
| Country | US |
| Kind code | A1 |
| Filing date | Mar 13, 2019 |
| Priority date | Mar 13, 2019 |
| Publication date | Sep 17, 2020 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Implementations described herein relate to methods, devices, and computer-readable media to perform gating for video analysis. In some implementations, a computer-implemented method includes obtaining a video comprising a plurality of frames and corresponding audio. The method further includes performing sampling to select a subset of the plurality of frames based on a target frame rate and extracting a respective audio spectrogram for each frame in the subset of the plurality of frames. The method further includes reducing resolution of the subset of the plurality of frames. The method further includes applying a machine-learning based gating model to the subset of the plurality of frames and corresponding audio spectrograms and obtaining, as output of the gating model, an indication of whether to analyze the video to add one or more video annotations.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method comprising: obtaining a video comprising a plurality of frames and corresponding audio; performing sampling to select a subset of the plurality of frames based on a target frame rate that is less than or equal to a frame rate of the video; extracting a respective audio spectrogram for each frame in the subset of the plurality of frames; reducing resolution of the subset of the plurality of frames; after reducing the resolution, applying a machine-learning based gating model to the subset of the plurality of frames and corresponding audio spectrograms; and obtaining, as output of the gating model, an indication of whether to analyze the video to add one or more video annotations. 2 . The computer-implemented method of claim 1 , further comprising, prior to applying the gating model, dividing the video into a plurality of segments, each segment including multiple frames, and wherein applying the gating model is performed iteratively over the plurality of segments in sequence, wherein the indication is generated at each iteration. 3 . The computer-implemented method of claim 2 , wherein each segment of the plurality of segments overlaps with another segment of the plurality of segments. 4 . The computer-implemented method of claim 2 , wherein if the indication at a particular iteration is that the video is to be analyzed, application of the gating model is terminated such that one or more of the plurality of segments are excluded. 5 . The computer-implemented method of claim 1 , wherein the gating model is trained to determine whether a particular feature is present in input videos provided to the gating model. 6 . The computer-implemented method of claim 5 , wherein the particular feature includes at least one of a human face, a type of object, a type of movement, or a type of audio. The computer-implemented method of claim 1 , wherein applying the gating model comprises: applying a first model that determines a likelihood that a particular feature is present; and applying a second model that receives as input the likelihood that the particular feature is present and generates the indication of whether to analyze the video. 8 . The computer-implemented method of claim 7 , wherein the first model includes: a first convolutional neural network that includes a plurality of layers, trained to analyze video; a second convolutional neural network that includes a plurality of layers, trained to analyze audio; and a fusion network that includes a plurality of layers, that receives output of the first convolutional neural network and the second convolutional neural network as inputs, and provides the likelihood that the particular feature is present to the second model. 9 . The computer-implemented method of claim 7 , wherein the second model is implemented using one or more of heuristics, a recurrent neural network, or a Markov chain analysis technique. 10 . The computer-implemented method of claim 7 , further comprising providing an additional input to the second model, wherein the additional input includes one or more of: identification of a portion of a particular frame of the subset of the plurality of frames in which the particular feature is detected to be present, a duration of time in which the particular feature appears in the subset of the plurality of frames, or heuristics regarding early termination, and wherein the second model utilizes the additional input to generate the indication. 11 . The computer-implemented method of claim 1 , further comprising, when the indication is to analyze the video, programmatically analyzing the video to add the one or more video annotations, wherein the video annotations comprise one or more labels that are indicative of presence in the video of one or more of a face, a particular type of object, a particular type of movement, or a particular type of audio. 12 . A computing device comprising: a processor; and a memory, with instructions stored thereon that, when executed by the processor cause the processor to perform operations comprising: obtaining a video comprising a plurality of frames and corresponding audio; performing sampling to select a subset of the plurality of frames based on a target frame rate that is less than or equal to a frame rate of the video; extracting a respective audio spectrogram for each frame in the subset of the plurality of frames; reducing resolution of the subset of the plurality of frames; after reducing the resolution, applying a machine-learning based gating model to the subset of the plurality of frames and corresponding audio spectrograms; and obtaining, as output of the gating model, an indication of whether to analyze the video to add one or more video annotations. 13 . The computing device of claim 12 , wherein the memory has further instructions stored thereon that, when executed by the processor cause the processor to perform further operations comprising, prior to applying the gating model, dividing the video into a plurality of segments, each segment including multiple frames, and wherein applying the gating model is performed iteratively over the plurality of segments in sequence, wherein the indication is generated at each iteration. 14 . A computer-implemented method to train a machine-learning based gating model to generate an indication of whether to analyze a video to add annotations corresponding to a particular feature, wherein the machine-learning based gating model comprises: a first model that comprises a first convolutional neural network that generates a likelihood that the particular feature is present in a video based on video frames of the video; and a second model that receives as input the likelihood that the particular feature is present in the video and generates the indication, the method comprising: obtaining a training set comprising: a plurality of training videos, wherein each training video comprises a plurality of frames, and wherein each training video is a low-resolution, sampled version of a corresponding high-resolution video; and a plurality of training labels, each training label indicative of presence of the particular feature in the high-resolution videos corresponding to the one or more of the plurality of training videos; training the gating model, wherein the training includes, for each training video in the training set, generating, by application of the first model to the training video, a likelihood that the particular feature is present in the training video; generating based on the likelihood that the particular feature is present in the training video, by application of the second model, the indication of whether to analyze the training video to add annotations corresponding to a particular feature; generating feedback data based on the training labels associated with the corresponding high-resolution video and the indication; and providing the feedback data as a training input to the first model and to the second model. 15 . The computer-implemented method of claim 14 , wherein the particular feature includes at least one of a human face, a type of movement, or a type of object. 16 . The computer-implemented method of claim 14 , wherein the plurality of training videos in the training set include at least one video in which the particular feature is present and at least one video in which the particular feature is absent, and wherein training the gating model comprises one or more of automatically adjusting a weight of one or more nodes of the first convolutiona
Detecting features for summarising video content · CPC title
Human faces, e.g. facial parts, sketches or expressions · CPC title
involving operations for analysing video streams, e.g. detecting features or characteristics (television picture signal circuitry for scene change detection H04N5/147; filtering for image enhancement G06T5/00; methods or arrangements for recognising scenes G06V20/00; arrangements characterised by components specially adapted for monitoring, identification or recognition of video in broadcast systems H04H60/59) · CPC title
the classifiers operating on different input data, e.g. multi-modal recognition · CPC title
in video content (extracting overlay text G06V20/62; video retrieval G06F16/70; processing of video elementary streams in video servers H04N21/234; processing of video elementary streams in video clients H04N21/44) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.