Systems and methods for performing multi-modal video datastream segmentation

US9253511B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9253511-B2
Application numberUS-201414325202-A
CountryUS
Kind codeB2
Filing dateJul 7, 2014
Priority dateApr 14, 2014
Publication dateFeb 2, 2016
Grant dateFeb 2, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods are described that can provide users with personalized video content feeds. In several embodiments, a multi-modal segmentation process is utilized that relies upon cues derived from video, audio and/or text data present in a video data stream. In a number of embodiments, video streams from a variety of sources are segmented. Links are identified between video segments and between video segments and online articles containing additional information relevant to the video segments. The additional information obtained by linking a video segment to an additional source of data can be utilized in the generation of personalized playlists. In the context of news programming, the dynamic mixing and aggregation of news videos from multiple sources can greatly enrich the news watching experience. In several embodiments, processes for linking video segments to additional sources of data can be implemented as part of a video search engine service.

First claim

Opening claim text (preview).

What is claimed is: 1. A multi-modal video data stream segmentation system, comprising: at least one processor; and memory containing a video segmentation application; wherein the video segmentation application configures at least one processor to perform a multi-modal segmentation of a video data stream including a sequence of frames of video, at least one audio track time synchronized with the sequence of frames of video, and closed caption textual data by: identifying visual segmentation cues within the sequence of frames of video; identifying audio segmentation cues within the at least one time synchronized audio track; performing automatic speech recognition on an audio track from the at least one audio track to generate audio track textual data that is time synchronized to the sequence of frames of video; identifying textual segmentation cues identified from the closed caption textual data; matching at least a portion of the closed caption textual data with the audio track textual data and time synchronizing the closed caption textual data to the sequence of frames of video data based upon the time synchronization of the matching audio track textual data; fuse the visual segmentation cues, the audio segmentation cues, and the textual segmentation cues to form a stream of segmentation cues time synchronized with the sequence of frames of video; and identify segmentation boundaries between frames of video within the sequence of frames of video using at least one classifier based upon the stream of segmentation cues. 2. The multi-modal video data stream segmentation system of claim 1 , where at least one of the visual segmentation cues is from the group consisting of anchor frames, logo frames, and dark frames. 3. The multi-modal video data stream segmentation system of claim 1 , wherein the visual segmentation cues include anchor frames. 4. The multi-modal video data stream segmentation system of claim 3 , wherein the video segmentation application configures the at least one processor to detect anchor frames by: detecting frames in the sequence of frames of video containing a face using a face detector; determining color histograms for the detected faces; clustering the color histograms; and identifying anchor frames as frames that contain a face having a color histogram from within a dominant cluster of color histograms. 5. The multi-modal video data stream segmentation system of claim 1 , wherein the visual segmentation cues include logo frames. 6. The multi-modal video data stream segmentation system of claim 5 , wherein the video segmentation application configures the at least one processor to detect that a given frame from the sequence of frames of video is a logo frame by performing feature matching between a set of logo images and the given frame. 7. The multi-modal video data stream segmentation system of claim 1 , wherein the video segmentation application configures the at least one processor to detect that a series of frames from the sequence of frames of video is a logo animation by performing feature matching between each of a series of logo animation frames and the corresponding frame in the series of frames. 8. The multi-modal video data stream segmentation system of claim 1 , wherein the visual segmentation cues include dark frames. 9. The multi-modal video data stream segmentation system of claim 5 , wherein the video segmentation application configures the at least one processor to detect that a given frame from the sequence of frames of video is a dark frame by detecting that the mean pixel intensity in at least one color channel of the frame is below a first threshold and the standard deviation of the pixel intensity in the at least one color channel is below a second threshold. 10. The multi-modal video data stream segmentation system of claim 1 , wherein the audio segmentation cues include pauses in speech having a duration exceeding a threshold. 11. The multi-modal video data stream segmentation system of claim 1 , wherein the textual segmentation cues include “>>>” markers within the closed caption textual data. 12. The multi-modal video data stream segmentation system of claim 1 , wherein the textual segmentation cues include the presence of a predetermined transition phrase within the closed caption textual data. 13. The multi-modal video data stream segmentation system of claim 1 , wherein: a plurality of the segmentation cues in the stream of segmentation cues include confidence scores; and the video segmentation application configures the at least one processor to identify segmentation boundaries between frames of video within the sequence of frames of video using at least one classifier based upon the stream of time stamped segmentation cues and the confidence scores. 14. The multi-modal video data stream segmentation system of claim 1 , wherein the at least one classifier is selected from the group consisting of a support vector machine, a neural-network classifier, and a decision tree classifier. 15. A method of segmenting a video data stream including a sequence of frames of video, at least one audio track time synchronized with the sequence of frames of video, and closed caption textual data, the method comprising: identifying visual segmentation cues within the sequence of frames of video using a video data stream segmentation system; identifying audio segmentation cues within the at least one audio track using the video data stream segmentation system; performing automatic speech recognition on an audio track from the at least one audio track to generate audio track textual data that is time synchronized to the sequence of frames of video using the video data stream segmentation system; identifying textual segmentation cues within the closed caption textual data using the video data stream segmentation system; matching at least a portion of the closed caption textual data with the audio track textual data and time synchronizing the closed caption textual data to the sequence of frames of video data based upon the time synchronization of the matching audio track textual data using the video data stream segmentation system; fusing the visual segmentation cues, the audio segmentation cues, and the textual segmentation cues to form a stream of segmentation cues that is time synchronized with the sequence of frames of video using the video data stream segmentation system; and identifying segmentation boundaries between frames of video within the sequence of frames of video using at least one classifier based upon the stream of segmentation cues using the video data stream segmentation system. 16. The method of claim 15 , where at least one of the visual segmentation cues is from the group consisting of anchor frames, logo frames, and dark frames. 17. The method of claim 15 , where the audio segmentation cues include pauses in speech having a duration exceeding a threshold. 18. The method of claim 15 , wherein at least one of the textual segmentation cues within the closed caption textual data is from the group consisting of “>>>” markers and a predetermined transition phrase. 19. The method of claim 15 , wherein: a plurality of the segmentation cues in the stream of segmentation cues include confidence scores; and identifying segmentation boundaries between frames of video within the sequence of frames of video using at least one classifier based upon the stream of segmentation cues using the video data stream segmentation system comprises identifying segmentation boundaries

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • involving operations for analysing video streams, e.g. detecting features or characteristics (television picture signal circuitry for scene change detection H04N5/147; filtering for image enhancement G06T5/00; methods or arrangements for recognising scenes G06V20/00; arrangements characterised by components specially adapted for monitoring, identification or recognition of video in broadcast systems H04H60/59) · CPC title

  • Secondary servers, e.g. proxy server, cable television Head-end {(provisioning of proxy services in data packet switching networks H04L67/56)} · CPC title

  • Creating a channel for a dedicated end-user group, e.g. insertion of targeted commercials based on end-user profiles {(information retrieval from the Internet by querying with filtering and personalisation G06F16/9535; arrangements for replacing or switching information during the broadcast H04H20/10; push services over packet-switching network H04L12/1859; adaptation of message content in packet-switching networks H04L51/063)} · CPC title

  • involving additional data, e.g. news, sports, stocks, weather forecasts · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9253511B2 cover?
Systems and methods are described that can provide users with personalized video content feeds. In several embodiments, a multi-modal segmentation process is utilized that relies upon cues derived from video, audio and/or text data present in a video data stream. In a number of embodiments, video streams from a variety of sources are segmented. Links are identified between video segments and be…
Who is the assignee on this patent?
Chen David Mo, Chen Huizhong, Daneshi Maryam, and 6 more
What technology area does this patent fall under?
Primary CPC classification H04N21/23418. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Feb 02 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).