What technology area does this patent fall under?

Primary CPC classification G06V20/49. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 03 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Unsupervised cue point discovery for episodic content

US12541977B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12541977-B2
Application number	US-202318205802-A
Country	US
Kind code	B2
Filing date	Jun 5, 2023
Priority date	Jun 5, 2023
Publication date	Feb 3, 2026
Grant date	Feb 3, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein are system, apparatus, article of manufacture, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof, for cue point discovery for content. For example, system, apparatus, article of manufacture, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof are provided for using unsupervised machine learning to automatically classify cue points for episodic content. The cue points can be associates with an opening credits section, an end credits section, a recap section, or a behind-the-scenes section.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method, comprising: dividing, by at least one computer processor, a video associated with an episode of an episodic content into a plurality of sections; determining a representation for each of the plurality of sections; comparing a first representation of a first section of the plurality of sections of the video with a plurality of representations, wherein the plurality of representations are associated with one or more sections of one or more episodes of the episodic content; determining a plurality of similarity values for the first representation based on the comparison; determining one or more of the plurality of similarity values that satisfy a condition; determining a temporal position of the first representation in response to the one or more of the plurality of similarity values satisfying the condition; and determining a type of the first section of the plurality of sections of the video by comparing the temporal position with a temporal position threshold and further based at least one or more of first region information of a first geographical region where the episodic content is produced or second region information of a second geographical region where the episodic content is being shown, wherein the type of the first section comprises an opening credits section, an end credits section, a recap section, or a behind-the-scenes section, and wherein the temporal position threshold depends on at least one or more of the first region information or the second region information. 2 . The computer-implemented method of claim 1 , wherein the representation comprises an image embedding, an audio embedding, a text embedding, or a combination of two or more of the image embedding, the audio embedding, and the text embedding. 3 . The computer-implemented method of claim 1 , wherein determining the type of the first section further comprises using one or more temporal positions corresponding to one or more of the plurality of representations associated with the one or more of the plurality of similarity values that satisfy the condition. 4 . The computer-implemented method of claim 1 , wherein using the temporal position associated with the first representation comprises: determining that the type of the first section comprises the opening credits section in response to the temporal position being before the first temporal position threshold. 5 . The computer-implemented method of claim 1 , wherein using the temporal position associated with the first representation comprises: determining that the type of the first section comprises the end credits section in response to the temporal position being after the first temporal position threshold. 6 . The computer-implemented method of claim 1 , wherein determining the type of the first section further comprises: using a text detection method to determine text within the first section of the plurality of sections of the video; and using the determined text to determine that the type of the first section comprises the end credits section. 7 . The computer-implemented method of claim 1 , wherein determining the type of the first section further comprises: using production information associated with the episodic content to determine the type of the first section. 8 . The computer-implemented method of claim 1 , wherein determining one or more of the plurality of similarity values that satisfy the condition comprises: comparing the plurality of similarity values with a second threshold; and determining that the one or more of the plurality of similarity values are greater than the second threshold. 9 . The computer-implemented method of claim 1 , wherein determining the representation, comparing the first representation with the plurality of representations, determining the plurality of similarity values, and determining the one or more of the plurality of similarity values satisfying the condition are part of an unsupervised machine learning model. 10 . The computer-implemented method of claim 1 , further comprising: determining two or more sections of the plurality of sections of the video that have similarity values that satisfy the condition; determining a number of the two or more sections; and in response to the number of the two or more sections satisfying a second threshold, using temporal positions associated with the two or more sections to determine a type of the two or more sections of the plurality of sections of the video. 11 . A system, comprising: one or more memories; and at least one processor each coupled to at least one of the memories and configured to perform operations comprising: dividing a video associated with an episode of an episodic content into a plurality of sections; determining a representation for each of the plurality of sections; comparing a first representation of a first section of the plurality of sections of the video with a plurality of representations, wherein the plurality of representations are associated with one or more sections of one or more episodes of the episodic content; determining a plurality of similarity values for the first representation based on the comparison; determining one or more of the plurality of similarity values that satisfy a condition; determining a temporal position of the first representation in response to the one or more of the plurality of similarity values satisfying the condition; and determining a type of the first section of the plurality of sections of the video by comparing the temporal position with a temporal position threshold and further based at least one or more of first region information of a first geographical region where the episodic content is produced or second region information of a second geographical region where the episodic content is being shown, wherein the type of the first section comprises an opening credits section, an end credits section, a recap section, or a behind-the-scenes section, wherein the temporal position threshold depends on at least one or more of the first region information or the second region information. 12 . The system of claim 11 , wherein: the representation comprises an image embedding, an audio embedding, a text embedding, or a combination of two or more of the image embedding, the audio embedding, and the text embedding; and determining the type of the first section further comprises using one or more temporal positions corresponding to one or more of the plurality of representations associated with the one or more of the plurality of similarity values that satisfy the condition. 13 . The system of claim 11 , wherein using the temporal position associated with the first representation comprises: determining that the type of the first section comprises the opening credits section in response to the temporal position being before the first temporal position threshold. 14 . The system of claim 11 , wherein using the temporal position associated with the first representation comprises: determining that the type of the first section comprises the end credits section in response to the temporal position being after the first temporal position threshold. 15 . The system of claim 11 , wherein determining the type of the first section further comprises: using a text detection method to determine text within the first section of the plurality of sections of the video; and using the determined text to determine that the type of the first section comprises the end credits section. 16 . The system of clai

Assignees

Roku Inc

Inventors

Classifications

G06V20/63
Scene text, e.g. street names · CPC title
G06V10/761
Proximity, similarity or dissimilarity measures · CPC title
G06V10/751
Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching · CPC title
G06V20/49Primary
Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title
G06V20/41
Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title

Patent family

Related publications grouped by family.

View patent family 91431389

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12541977B2 cover?: Disclosed herein are system, apparatus, article of manufacture, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof, for cue point discovery for content. For example, system, apparatus, article of manufacture, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof are provided for using unsupervised m…
Who is the assignee on this patent?: Roku Inc
What technology area does this patent fall under?: Primary CPC classification G06V20/49. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 03 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).