Video summarization using audio and visual cues

US10134440B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10134440-B2
Application numberUS-201113099391-A
CountryUS
Kind codeB2
Filing dateMay 3, 2011
Priority dateMay 3, 2011
Publication dateNov 20, 2018
Grant dateNov 20, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for producing an audio-visual slideshow for a video sequence having an audio soundtrack and a corresponding video track including a time sequence of image frames, comprising: segmenting the audio soundtrack into a plurality of audio segments; subdividing the audio segments into a sequence of audio frames; determining a corresponding audio classification for each audio frame; automatically selecting a subset of the audio segments responsive to the audio classification for the corresponding audio frames; for each of the selected audio segments automatically analyzing the corresponding image frames to select one or more key image frames; merging the selected audio segments to form an audio summary; forming an audio-visual slideshow by combining the selected key frames with the audio summary, wherein the selected key frames are displayed synchronously with their corresponding audio segment; and storing the audio-visual slideshow in a processor-accessible storage memory.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for producing an audio-visual slideshow from a video, comprising: receiving a video sequence, the video sequence comprising image frames and a corresponding audio soundtrack; dividing the audio soundtrack into audio frames, wherein the audio frames are divided based on a predefined time interval; extracting an audio feature vector from each of the audio frames; applying an audio classification model to the audio feature vectors, wherein the audio classification model determines a corresponding audio classification for each of the audio frames; using a clustering algorithm to form audio frame clusters, wherein the audio frame clusters comprise audio frames having a same corresponding audio classification; selecting an audio frame from each of the audio frame clusters; segmenting the audio soundtrack into audio segments using a change detection operation; selecting the audio segments that contain the selected audio frames; identifying a subset of the selected audio segments, wherein the subset of the selected audio segments includes selected audio frames from a diverse set of audio frame clusters; determining which of the image frames correspond to the selected subset of audio segments; selecting key image frames from the image frames corresponding to the selected subset of audio segments, wherein the selected number of key image frames is less than the total number of image frames that correspond to the selected subset of audio segments; merging the selected subset of audio segments to form an audio summary; combining the selected key image frames with the audio summary; and displaying the selected key image frames synchronously with their corresponding audio segments. 2. The method of claim 1 , wherein the audio classification for each audio frame is determined using one or more audio classification models trained using a ground-truth data set. 3. The method of claim 2 , wherein the one or more audio classification models comprises a support vector machine (SVM) model. 4. The method of claim 2 , wherein a set of audio classification models are used to determine classification scores for each of a predetermined subset of the diverse set of audio frame clusters. 5. The method of claim 1 , wherein the clustering algorithm comprises a K-means algorithm. 6. The method of claim 1 , wherein identifying the subset of the selected audio segments includes: for each audio frame cluster, selecting an audio frame corresponding to each relevant audio classification; and selecting the audio segments that include the selected audio frames. 7. The method of claim 1 , wherein the change detection operation to identify identifies appropriate audio segment boundaries corresponding to substantial changes in audio characteristics. 8. The method of claim 7 , wherein applying the change detection operation comprises applying a Bayesian information criterion. 9. The method of claim 1 , further comprising expanding the selected audio segments by appending to the selected audio segments one or more other audio segments having similar audio characteristics to the selected audio segments. 10. The method of claim 1 , wherein selecting the key image frames comprises: identifying an image frame subset corresponding to a particular audio segment; determining one or more visual quality scores for each of the image frames in the image frame subset; and selecting one or more key image frames from the image frame subset responsive to the one or more visual quality scores. 11. The method of claim 10 , wherein the image frame subset includes a sampling of the image frames corresponding to the particular audio segment. 12. The method of claim 10 , wherein the one or more visual quality scores include a facial quality score and an overall image quality score. 13. The method of claim 12 , wherein a determination of the facial quality score for a particular image frame comprises: analyzing the particular image frame using a face detection process to detect the presence of any faces; determining visual feature vectors for the detected presence of faces; and determining the facial quality score responsive to the visual feature vectors. 14. The method of claim 10 , wherein the key image frames are selected according to a visual diversity criterion. 15. The method of claim 10 , wherein selecting the one or more key image frames from the image frame subset responsive to the one or more visual quality scores comprises: identifying a set of candidate key image frames having the highest visual quality scores; determining a visual feature vector for each of the candidate key image frames; computing visual distance values between the candidate key image values responsive to the visual feature values; and selecting a subset of the candidate key image frames to be the key image frames responsive to the visual distance values. 16. The method of claim 15 , wherein the selected number of key image frames are selected such that each of the selected number of key image frames are separated by a visual distance value that exceeds a predefined threshold visual distance value. 17. The method of claim 1 , wherein the selected number of key image frames are sorted into chronological order. 18. The method of claim 1 , wherein combining the selected key image frames with the audio summary forms an audio-visual slideshow and the audio-visual slideshow is stored in a video file using a video file format adapted to be played using a standard video player. 19. The method of claim 1 , wherein each of the selected number of key image frames is displayed for a time interval, and wherein the time interval is determined by dividing the length of the selected audio segment by the selected number of key image frames for the respective selected audio segment.

Assignees

Inventors

Classifications

  • G11B27/034Primary

    on discs (G11B27/036, G11B27/038 take precedence) · CPC title

  • by using information not detectable on the record carrier · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10134440B2 cover?
A method for producing an audio-visual slideshow for a video sequence having an audio soundtrack and a corresponding video track including a time sequence of image frames, comprising: segmenting the audio soundtrack into a plurality of audio segments; subdividing the audio segments into a sequence of audio frames; determining a corresponding audio classification for each audio frame; automatica…
Who is the assignee on this patent?
Jiang Wei, Loui Alexander C, Cotton Courtenay, and 1 more
What technology area does this patent fall under?
Primary CPC classification G11B27/034. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 20 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).