Frame Level And Video Level Text Detection In Video

US2020349381A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2020349381-A1
Application numberUS-201916399823-A
CountryUS
Kind codeA1
Filing dateApr 30, 2019
Priority dateApr 30, 2019
Publication dateNov 5, 2020
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In some embodiments, a method detects a first set of frames in a video that include lines of text, the detecting performed at a frame level on each individual frame. A first representation is generated from the first set of frames and a second representation is generated from the first set of frames. The method filters the first representation based on a number of lines of text within a space in the space dimension to select a second set of frames and filters the second representation based on a number of frames within time intervals in the time dimension to select a third set of frames. Frames in both the second set of frames and the third set of frames are analyzed to determine whether the lines of text in both the second set of frames and the third set of frames are burned-in subtitles.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: detecting, by a computing device, a first set of frames in a video that include lines of text, the detecting performed at a frame level on each individual frame; generating, by the computing device, a first representation from at least a first portion of the first set of frames and a second representation from at least a second portion of the first set of frames, the first representation generated based on a space dimension and the second representation generated based on a time dimension; filtering, by the computing device, the first representation based on a number of lines of text within a space in the space dimension to select a second set of frames; filtering, by the computing device, the second representation based on a number of frames within one or more time intervals in a plurality of time intervals in the time dimension to select a third set of frames; and analyzing, by the computing device, frames in both the second set of frames and the third set of frames to determine whether the lines of text in both the second set of frames and the third set of frames are burned-in subtitles. 2 . The method of claim 1 , wherein detecting the first set of frames comprises analyzing each frame for lines of text without using information from other frames. 3 . The method of claim 1 , wherein generating the first representation and the second representation comprises analyzing multiple frames from the first set of frames. 4 . The method of claim 1 , wherein detecting the first set of frames comprises: predicting whether lines of characters are present in a frame; predicting character positions from the lines of text; and connecting the character positions together to form lines of text. 5 . The method of claim 1 , wherein filtering the first representation based on the number of lines of text within the space in the space dimension comprises: summarizing the number of lines of text found in different spaces in the at least the first portion of the first set of frames; and selecting the second set of frames based on the number of lines of text found in the space. 6 . The method of claim 1 , wherein filtering the first representation based on the number of lines of text within the space in the space dimension comprises: generating a structure that includes a value for a number of text lines found in different spaces in the at least the first portion of the first set of frames; and selecting the second set of frames based on values for the number of text lines found in the space. 7 . The method of claim 1 , wherein filtering the second representation based on the number of frames within the one or more time intervals in the plurality of time intervals in the time dimension comprises summarizing a number of lines of text in the at least the second portion of the first set of frames that are found in different time intervals; and selecting the third set of frames based on the number of lines of text in the one or more time intervals. 8 . The method of claim 1 , wherein filtering the second representation based on the number of frames within the time interval in the plurality of time intervals in the time dimension comprises generating a structure that includes a value for a number of lines of text in the at least the first portion of the first set of frames that are found in different time intervals in the plurality of time intervals; and selecting the third set of frames based on the number of lines of text in the one or more time intervals. 9 . The method of claim 1 , wherein filtering the second representation based on the number of frames within the one or more time intervals in the plurality of time intervals in the time dimension comprises summarizing a number of frames in the at least the first portion of the first set of frames that are found in different time intervals in the plurality of time intervals; and selecting the third set of frames based on the number of frames in the one or more time intervals. 10 . The method of claim 1 , wherein analyzing the frames in both the second set of frames and the third set of frames to determine whether the lines of text in both the second set of frames and the third set of frames are burned-in subtitles comprises: comparing a number of the frames that are included in both the second set of frames and the third set of frames to a threshold; determining that the video includes burned-in subtitles when the number of frames is above the threshold; and determining that the video does not include burned-in subtitles when the number of frames is above the threshold. 11 . The method of claim 1 , further comprising: when frames that are included in both the second set of frames and the third set of frames are determined to include burned-in subtitles, analyzing the burned-in subtitles to determine a language type of the burned-in subtitles. 12 . The method of claim 11 , wherein analyzing the burned-in subtitles comprises: analyzing the burned-in subtitles for a character type; and when the character type is associated with a language in a plurality of languages, outputting the language associated with the character type. 13 . The method of claim 12 , wherein analyzing the burned-in subtitles comprises: when the character type is associated with multiple languages, analyzing words of the burned-in subtitles to select a language within the multiple languages. 14 . The method of claim 1 , wherein generating the first representation from at least the first portion of the first set of frames and the second representation from at least the second portion of the first set of frames comprises: generating the first representation from the first set of frames; and generating the second representation from the second set of frames. 15 . The method of claim 1 , wherein generating the first representation from at least the first portion of the first set of frames and the second representation from at least the second portion of the first set of frames comprises: generating the second representation from the first set of frames; and generating the first representation from the second set of frames. 16 . A non-transitory computer-readable storage medium containing instructions, that when executed, control a computer system to be operable for: detecting a first set of frames in a video that include lines of text, the detecting performed at a frame level on each individual frame; generating a first representation from at least a first portion of the first set of frames and a second representation from at least a second portion of the first set of frames, the first representation generated based on a space dimension and the second representation generated based on a time dimension; filtering the first representation based on a number of lines of text within a space in the space dimension to select a second set of frames; filtering the second representation based on a number of frames within one or more time intervals in a plurality of time intervals in the time dimension to select a third set of frames; and analyzing frames in both the second set of frames and the third set of frames to determine whether the lines of text in both the second set of frames and the third set of frames are burned-in subtitles. 17 . The non-transitory computer-readable storage medium of claim 16 , wherein: detecting the first set of frames comprises analyzing each frame for lines of text without using information from other frames, and generating the firs

Assignees

Inventors

Classifications

  • of Kanji, Hiragana or Katakana characters · CPC title

  • Overlay text, e.g. embedded captions in a TV programme · CPC title

  • Classification techniques · CPC title

  • G06V10/82Primary

    using neural networks · CPC title

  • using recognition of characters or words · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2020349381A1 cover?
In some embodiments, a method detects a first set of frames in a video that include lines of text, the detecting performed at a frame level on each individual frame. A first representation is generated from the first set of frames and a second representation is generated from the first set of frames. The method filters the first representation based on a number of lines of text within a space i…
Who is the assignee on this patent?
Hulu Llc
What technology area does this patent fall under?
Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Nov 05 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).