Language agnostic missing subtitle detection

US11538461B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11538461-B1
Application numberUS-202117249930-A
CountryUS
Kind codeB1
Filing dateMar 18, 2021
Priority dateMar 18, 2021
Publication dateDec 27, 2022
Grant dateDec 27, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Some implementations include methods for detecting missing subtitles associated with a media presentation and may include receiving an audio component and a subtitle component associated with a media presentation, the audio component including an audio sequence, the audio sequence divided into a plurality of audio segments; evaluating the plurality of audio segments using a combination of a recurrent neural network and a convolutional neural network to identify refined speech segments associated with the audio sequence, the recurrent neural network trained based on a plurality of languages, the convolutional neural network trained based on a plurality of categories of sound; determining timestamps associated with the identified refined speech segments; and determining missing subtitles based on the timestamps associated with the identified refined speech segments and timestamps associated with subtitles included in the subtitle component.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: selecting a media presentation that includes a subtitle component and an audio component, the audio component including an audio sequence with associated audio timestamps, the subtitle component including subtitles with associated subtitle timestamps; dividing the audio sequence into a plurality of audio segments with an overlap between consecutive audio segments, each of the plurality of audio segments having a first duration; evaluating each of the plurality of audio segments using a voice activity detection (VAD) network to identify speech segments and non-speech segments; combining consecutive speech segments to form a plurality of combined speech segments, each of the plurality of the combined speech segments having a second duration, wherein the second duration is configured to be longer than the first duration; classifying each of the plurality of combined speech segments by an audio classification (AC) network to a category of sound from a plurality of categories of sound; based on the classifying by the AC network, identifying one or more of the combined speech segments classified to a non-speech category of sound; identifying one or more of the combined speech segments as speech based on the one or more combined speech segments classified to the non-speech category of sound; determining first audio timestamp of the audio timestamps associated with the one or more combined speech segments identified as speech; and generating a notification indicating missing subtitles from the subtitle component based on comparing the first audio timestamp associated with the one or more combined speech segments identified as speech and the subtitle timestamps associated with the subtitle component. 2. The method of claim 1 , wherein the VAD network is configured to perform operations associated with a recurrent neural network, and wherein the AC network is configured to perform operations associated with a convolutional neural network. 3. The method of claim 2 , wherein the determining the timestamps associated with the one or more combined speech segments identified as speech comprises determining a beginning timestamp and an ending timestamp for each of the one or more combined speech segments identified as speech, and wherein the generating of the notification indicating the missing subtitles from the subtitle component comprises comparing the beginning timestamp and the ending timestamp for each of the one or more combined speech segments identified as speech with a beginning timestamp and an ending timestamp for each subtitle in the subtitle component to identify the missing subtitles from the subtitle component. 4. A computer-implemented method comprising: receiving an audio component and a subtitle component associated with a media presentation, the audio component including an audio sequence, the audio sequence divided into a plurality of audio segments; evaluating the plurality of audio segments using a combination of a first neural network and a second neural network to identify refined speech segments associated with the audio sequence, the first neural network trained based on a plurality of languages, the second neural network trained based on a plurality of categories of sound, wherein the first neural network evaluates a first set of audio segments of the plurality of audio segments, each audio segment in the first set of audio segments having a first duration, and wherein the second neural network evaluates a second set of audio segments of the plurality of audio segments, each audio segment in the second set of audio segments having a second duration, the second duration being longer than the first duration; determining timestamps associated with the identified refined speech segments; and determining missing subtitles based on the timestamps associated with the identified refined speech segments and timestamps associated with subtitles included in the subtitle component. 5. The method of claim 4 , wherein the first neural network is a recurrent neural network and the second neural network is a convolutional neural network, and wherein the evaluating of the plurality of audio segments using the combination of the first neural network and the second neural network comprises evaluating the plurality of audio segments using the first neural network in sequence with the second neural network. 6. The method of claim 5 , wherein each of the plurality of audio segments has a first duration, and wherein an overlap exists between two consecutive audio segments. 7. The method of claim 6 , wherein the recurrent neural network is configured to identify each of the plurality of audio segments as either a speech segment or a non-speech segment, wherein consecutive speech segments identified by the recurrent neural network are combined to generate a plurality of combined speech segments, each of the plurality of combined speech segments having a second duration, the second duration being longer than the first duration. 8. The method of claim 7 , wherein each of the plurality of combined speech segments is non-overlapping with any other combined speech segment, and wherein the convolutional neural network is configured to classify each of the plurality of the combined speech segments to a category of sound to identify the refined speech segments. 9. The method of claim 4 , wherein the first neural network is a recurrent neural network and the second neural network is a convolutional neural network, and wherein the first neural network is configured to evaluate the plurality of audio segments independently of the second neural network. 10. The method of claim 9 , wherein consecutive audio segments in the first set of audio segments are partially overlapped, and wherein the recurrent neural network is configured to identify each audio segment in the first set of audio segments as either a speech segment or a non-speech segment. 11. The method of claim 10 , wherein the convolutional neural network is configured to classify each audio segment in the second set of audio segments to a category of sound from a plurality of categories of sound and to identify whether an audio segment in the second set of audio segment is a speech segment or a non-speech segment based on a probability value associated with a category of sound. 12. The method of claim 11 , wherein the speech segments associated with the first set of audio segments and the speech segments associated with the second set of audio segments are used to identify the refined speech segments associated with the audio sequence. 13. A system comprising: memory configured to store computer-executable instructions; and at least one computer processor configured to access the memory and execute the computer-executable instructions to: receive an audio component and a subtitle component associated with a media presentation, the audio component including an audio sequence, the audio sequence divided into a plurality of audio segments; evaluate the plurality of audio segments using a combination of a first neural network and a second neural network to identify refined speech segments associated with the audio sequence, the first neural network trained based on a plurality of languages, the second neural network trained based on a plurality of categories of sound, wherein the first neural network evaluates a first set of audio segments of the plurality of audio segments, each audio segment in the first set of audio segments having a first duration, and wherein the second neural network evaluates a second set of audio segments of the plurality of audio segme

Assignees

Inventors

Classifications

  • G10L15/083Primary

    Recognition networks (G10L15/142, G10L15/16 take precedence) · CPC title

  • for displaying subtitles · CPC title

  • using artificial neural networks · CPC title

  • Discriminating between voiced and unvoiced parts of speech signals (G10L25/90 takes precedence) · CPC title

  • for comparison or discrimination · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11538461B1 cover?
Some implementations include methods for detecting missing subtitles associated with a media presentation and may include receiving an audio component and a subtitle component associated with a media presentation, the audio component including an audio sequence, the audio sequence divided into a plurality of audio segments; evaluating the plurality of audio segments using a combination of a rec…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/083. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 27 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).