Driving sound library, apparatus for generating driving sound library and vehicle comprising driving sound library
US-2021407492-A1 · Dec 30, 2021 · US
US11538461B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-11538461-B1 |
| Application number | US-202117249930-A |
| Country | US |
| Kind code | B1 |
| Filing date | Mar 18, 2021 |
| Priority date | Mar 18, 2021 |
| Publication date | Dec 27, 2022 |
| Grant date | Dec 27, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Some implementations include methods for detecting missing subtitles associated with a media presentation and may include receiving an audio component and a subtitle component associated with a media presentation, the audio component including an audio sequence, the audio sequence divided into a plurality of audio segments; evaluating the plurality of audio segments using a combination of a recurrent neural network and a convolutional neural network to identify refined speech segments associated with the audio sequence, the recurrent neural network trained based on a plurality of languages, the convolutional neural network trained based on a plurality of categories of sound; determining timestamps associated with the identified refined speech segments; and determining missing subtitles based on the timestamps associated with the identified refined speech segments and timestamps associated with subtitles included in the subtitle component.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: selecting a media presentation that includes a subtitle component and an audio component, the audio component including an audio sequence with associated audio timestamps, the subtitle component including subtitles with associated subtitle timestamps; dividing the audio sequence into a plurality of audio segments with an overlap between consecutive audio segments, each of the plurality of audio segments having a first duration; evaluating each of the plurality of audio segments using a voice activity detection (VAD) network to identify speech segments and non-speech segments; combining consecutive speech segments to form a plurality of combined speech segments, each of the plurality of the combined speech segments having a second duration, wherein the second duration is configured to be longer than the first duration; classifying each of the plurality of combined speech segments by an audio classification (AC) network to a category of sound from a plurality of categories of sound; based on the classifying by the AC network, identifying one or more of the combined speech segments classified to a non-speech category of sound; identifying one or more of the combined speech segments as speech based on the one or more combined speech segments classified to the non-speech category of sound; determining first audio timestamp of the audio timestamps associated with the one or more combined speech segments identified as speech; and generating a notification indicating missing subtitles from the subtitle component based on comparing the first audio timestamp associated with the one or more combined speech segments identified as speech and the subtitle timestamps associated with the subtitle component. 2. The method of claim 1 , wherein the VAD network is configured to perform operations associated with a recurrent neural network, and wherein the AC network is configured to perform operations associated with a convolutional neural network. 3. The method of claim 2 , wherein the determining the timestamps associated with the one or more combined speech segments identified as speech comprises determining a beginning timestamp and an ending timestamp for each of the one or more combined speech segments identified as speech, and wherein the generating of the notification indicating the missing subtitles from the subtitle component comprises comparing the beginning timestamp and the ending timestamp for each of the one or more combined speech segments identified as speech with a beginning timestamp and an ending timestamp for each subtitle in the subtitle component to identify the missing subtitles from the subtitle component. 4. A computer-implemented method comprising: receiving an audio component and a subtitle component associated with a media presentation, the audio component including an audio sequence, the audio sequence divided into a plurality of audio segments; evaluating the plurality of audio segments using a combination of a first neural network and a second neural network to identify refined speech segments associated with the audio sequence, the first neural network trained based on a plurality of languages, the second neural network trained based on a plurality of categories of sound, wherein the first neural network evaluates a first set of audio segments of the plurality of audio segments, each audio segment in the first set of audio segments having a first duration, and wherein the second neural network evaluates a second set of audio segments of the plurality of audio segments, each audio segment in the second set of audio segments having a second duration, the second duration being longer than the first duration; determining timestamps associated with the identified refined speech segments; and determining missing subtitles based on the timestamps associated with the identified refined speech segments and timestamps associated with subtitles included in the subtitle component. 5. The method of claim 4 , wherein the first neural network is a recurrent neural network and the second neural network is a convolutional neural network, and wherein the evaluating of the plurality of audio segments using the combination of the first neural network and the second neural network comprises evaluating the plurality of audio segments using the first neural network in sequence with the second neural network. 6. The method of claim 5 , wherein each of the plurality of audio segments has a first duration, and wherein an overlap exists between two consecutive audio segments. 7. The method of claim 6 , wherein the recurrent neural network is configured to identify each of the plurality of audio segments as either a speech segment or a non-speech segment, wherein consecutive speech segments identified by the recurrent neural network are combined to generate a plurality of combined speech segments, each of the plurality of combined speech segments having a second duration, the second duration being longer than the first duration. 8. The method of claim 7 , wherein each of the plurality of combined speech segments is non-overlapping with any other combined speech segment, and wherein the convolutional neural network is configured to classify each of the plurality of the combined speech segments to a category of sound to identify the refined speech segments. 9. The method of claim 4 , wherein the first neural network is a recurrent neural network and the second neural network is a convolutional neural network, and wherein the first neural network is configured to evaluate the plurality of audio segments independently of the second neural network. 10. The method of claim 9 , wherein consecutive audio segments in the first set of audio segments are partially overlapped, and wherein the recurrent neural network is configured to identify each audio segment in the first set of audio segments as either a speech segment or a non-speech segment. 11. The method of claim 10 , wherein the convolutional neural network is configured to classify each audio segment in the second set of audio segments to a category of sound from a plurality of categories of sound and to identify whether an audio segment in the second set of audio segment is a speech segment or a non-speech segment based on a probability value associated with a category of sound. 12. The method of claim 11 , wherein the speech segments associated with the first set of audio segments and the speech segments associated with the second set of audio segments are used to identify the refined speech segments associated with the audio sequence. 13. A system comprising: memory configured to store computer-executable instructions; and at least one computer processor configured to access the memory and execute the computer-executable instructions to: receive an audio component and a subtitle component associated with a media presentation, the audio component including an audio sequence, the audio sequence divided into a plurality of audio segments; evaluate the plurality of audio segments using a combination of a first neural network and a second neural network to identify refined speech segments associated with the audio sequence, the first neural network trained based on a plurality of languages, the second neural network trained based on a plurality of categories of sound, wherein the first neural network evaluates a first set of audio segments of the plurality of audio segments, each audio segment in the first set of audio segments having a first duration, and wherein the second neural network evaluates a second set of audio segments of the plurality of audio segme
Recognition networks (G10L15/142, G10L15/16 take precedence) · CPC title
for displaying subtitles · CPC title
using artificial neural networks · CPC title
Discriminating between voiced and unvoiced parts of speech signals (G10L25/90 takes precedence) · CPC title
for comparison or discrimination · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.