What technology area does this patent fall under?

Primary CPC classification G10L15/26. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 14 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Media system with closed-captioning data and/or subtitle data generation features

US12198700B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12198700-B2
Application number	US-202318328358-A
Country	US
Kind code	B2
Filing date	Jun 2, 2023
Priority date	Jun 2, 2023
Publication date	Jan 14, 2025
Grant date	Jan 14, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one aspect, an example method includes (i) obtaining media, wherein the obtained media includes (a) audio representing speech and (b) video; (ii) using at least the audio representing speech as a basis to generate speech text; (iii) using at least the audio representing speech to determine starting and ending time points of the speech; and (iv) using at least the generated speech text and the determined starting and ending time points of the speech to (a) generate closed-captioning or subtitle data that includes closed-captioning or subtitle text based on the generated speech text and (b) associating the generated closed-captioning or subtitle data with the obtained media, such that the closed-captioning or subtitle text is time-aligned with the video based on the determined starting and ending time points of the speech.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method comprising: obtaining media, wherein the obtained media includes (i) audio representing speech, (ii) video, and (iii) metadata associated with the obtained media; using at least the audio representing speech as a basis to generate speech text, wherein using at least the audio representing speech as the basis to generate the speech text comprises: (i) providing to a trained model, at least audio data for the audio representing speech and the metadata associated with the obtained media, wherein the metadata associated with the obtained media includes a rating of the obtained media; and (ii) responsive to the providing, receiving from the trained model, generated speech text generated by the trained model; using at least the audio representing speech as a basis to determine starting and ending time points of the speech; and using at least the generated speech text and the determined starting and ending time points of the speech to (i) generate closed-captioning or subtitle data that includes closed-captioning or subtitle text based on the generated speech text and (ii) associating the generated closed-captioning or subtitle data with the obtained media, such that the closed-captioning or subtitle text is time-aligned with the video based on the determined starting and ending time points of the speech. 2. The method of claim 1 , further comprising: extracting from the obtained media, the audio representing speech, wherein using the audio representing speech as the basis to generate the speech text comprises using the extracted audio representing speech as the basis to generate the speech text. 3. The method of claim 1 , wherein using at least the audio representing speech as the basis to generate the speech text comprises: using at least (i) the audio representing speech and (ii) mouth movement depictions of the video, as the basis to generate the speech text. 4. The method of claim 1 , wherein using at least the audio representing speech to determine starting and ending time points of the speech comprises: providing to a trained model, at least audio data for the audio representing speech; and responsive to the providing, receiving from the trained model, starting and ending time points determined by the trained model. 5. The method of claim 1 , wherein the obtained media further includes audio representing a sound effect, and wherein the method further comprises: using at least the audio representing the sound effect as a basis to generate sound effect description text. 6. The method of claim 5 , wherein using at least the audio representing the sound effect as the basis to generate the sound effect description text comprises: providing to a trained model, at least audio data for the audio representing the sound effect; and responsive to the providing, receiving from the trained model, sound effect description text generated by the trained model. 7. The method of claim 1 , wherein the generated closed-captioning or subtitle data is generated closed-captioning data, and wherein associating the generated closed-captioning data with the obtained media comprises: storing the generated closed-captioning data as metadata associated with the obtained media. 8. The method of claim 7 , further comprising: transmitting to a media-presentation device, the obtained media and the generated closed-captioning data as metadata of the media, wherein the media-presentation device is configured to (i) receive the transmitted media and closed-captioning data as metadata of the media, and (ii) present the received media with closed-captioning text overlaid thereon in accordance with the received closed-captioning data. 9. The method of claim 1 , wherein the generated closed-captioning or subtitle data is generated subtitle data, and wherein associating the generated subtitle data with the obtained media comprises: modifying the obtained media by overlaying on it subtitle text in accordance with the subtitle data. 10. The method of claim 9 , further comprising: transmitting to a media-presentation device, the modified media, wherein the media-presentation device is configured to receive and output for presentation the modified media. 11. The method of claim 1 , wherein the generated closed-captioning or subtitle data is generated closed-captioning data, and wherein the method further comprises: outputting for presentation, by a media presentation device, media with closed-captioning text overlaid thereon in accordance with the closed-captioning data. 12. The method of claim 1 , wherein the generated closed-captioning or subtitle data is generated subtitle data, and wherein the method further comprises: outputting for presentation, by a media presentation device, media modified to include subtitle text in accordance with the subtitle data. 13. The method of claim 1 , further comprising: determining that the speech was spoken by a character associated with a given region within the video; and based on the determining, outputting the speech text in or near that given region. 14. The method of claim 1 , further comprising: determining that the speech was spoken by a given character; and based on the determining, outputting the speech text in a font color associated with the given character. 15. A computing system comprising a processor and a non-transitory computer-readable storage medium having stored thereon program instructions that upon execution by the processor, cause the computing system to perform a set of acts comprising: obtaining media, wherein the obtained media includes (i) audio representing speech, (ii) video, and (iii) metadata associated with the obtained media; using at least the audio representing speech as a basis to generate speech text, wherein using at least the audio representing speech as the basis to generate the speech text comprises: (i) providing to a trained model, at least audio data for the audio representing speech and the metadata associated with the obtained media, wherein the metadata associated with the obtained media includes a rating of the obtained media; and (ii) responsive to the providing, receiving from the trained model, generated speech text generated by the trained model; using at least the audio representing speech to determine starting and ending time points of the speech; and using at least the generated speech text and the determined starting and ending time points of the speech to (i) generate closed-captioning or subtitle data that includes closed-captioning or subtitle text based on the generated speech text and (ii) associating the generated closed-captioning or subtitle data with the obtained media, such that the closed-captioning or subtitle text is time-aligned with the video based on the determined starting and ending time points of the speech. 16. The computing system of claim 15 , wherein the generated closed-captioning or subtitle data is generated closed-captioning data, and wherein associating the generated closed-captioning data with the obtained media comprises: storing the generated closed-captioning data as metadata associated with the obtained media. 17. A non-transitory computer-readable storage medium having stored thereon program instructions that upon execution by a processor, cause a computing system to perform a set of acts comprising: obtaining media, wherein the obtained media includes (i) audio representing speech, (ii) video, and (iii) metadata associated with the obtained media; using at least the audio representing speech as a basis to ge

Assignees

Roku Inc

Inventors

Classifications

G10L25/78
Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title
G10L15/25
using position of the lips, movement of the lips or face analysis · CPC title
H04N21/4884
for displaying subtitles · CPC title
G10L25/57
for processing of video signals · CPC title
H04N21/4307
Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen · CPC title

Patent family

Related publications grouped by family.

View patent family 93652542

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12198700B2 cover?: In one aspect, an example method includes (i) obtaining media, wherein the obtained media includes (a) audio representing speech and (b) video; (ii) using at least the audio representing speech as a basis to generate speech text; (iii) using at least the audio representing speech to determine starting and ending time points of the speech; and (iv) using at least the generated speech text and th…
Who is the assignee on this patent?: Roku Inc
What technology area does this patent fall under?: Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 14 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).