Language agnostic automated voice activity detection

US11205445B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11205445-B1
Application numberUS-201916436351-A
CountryUS
Kind codeB1
Filing dateJun 10, 2019
Priority dateJun 10, 2019
Publication dateDec 21, 2021
Grant dateDec 21, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods, and computer-readable media are disclosed for systems and methods for language agnostic automated voice activity detection. Example methods may include determining an audio file associated with video content, generating a number of audio segments using the audio file, the plurality of audio segments including a first segment and a second segment, where the first segment and the second segment are consecutive segments. Example methods may include determining, using a Gated Recurrent Unit neural network, that the first segment includes first voice activity, determining, using the Gated Recurrent Unit neural network, that the second segment includes second voice activity, and determining that voice activity is present between a first timestamp associated with the first segment and a second timestamp associated with the second segment.

First claim

Opening claim text (preview).

That which is claimed is: 1. A method comprising: determining, by one or more computer processors coupled to memory, a subtitle file and a first audio file for a first movie, the subtitle file comprising subtitle data representing dialogue that occurs in the first movie, wherein the subtitle file comprises text data and corresponding timestamp data indicative of when certain text is to be presented as a subtitle; extracting the timestamp data from the subtitle file; training a gated recurrent unit neural network using the timestamp data and the audio file, wherein the gated recurrent unit neural network is configured to determine whether human speech is present in an audio segment; determining a second audio file for a second movie; generating a first audio segment and a second audio segment using the second audio file; determining, using the gated recurrent unit neural network, that human speech is present in the first audio segment; determining, using the gated recurrent unit neural network, that human speech is not present in the second audio segment; generating a speech not present label for association with the second audio segment; determining a first timestamp corresponding to a start of the first audio segment, and a second timestamp corresponding to an end of the first audio segment; generating a speech present label for association with the first timestamp and the second timestamp; determining a third timestamp corresponding to a start of the second audio segment, and a fourth timestamp corresponding to an end of the second audio segment; and generating a speech not present label for association with the third timestamp and the fourth timestamp. 2. The method of claim 1 , further comprising: generating an empty subtitle file comprising an indication that speech is present between the first timestamp and the second timestamp, wherein the empty subtitle file does not include a transcription of the speech. 3. The method of claim 1 , further comprising: modifying the first audio file to include random background noise; determining that the timestamp data indicates speech is present for a length of time that exceeds a threshold; determining a first portion of audio corresponding to the speech; determining that the speech is not present for the duration of the length of time; and determining adjusted timestamp data indicative of a second portion of the audio for which speech is not present. 4. The method of claim 1 , further comprising: generating a first spectrogram using the first audio segment, and a second spectrogram using the second audio segment; and processing the first spectrogram and the second spectrogram using the gated recurrent unit neural network; wherein determining, using the gated recurrent unit neural network, that human speech is present in the first audio segment comprises: determining, using the gated recurrent unit neural network, a first probability value indicative of speech being present in the first audio segment; determining, using the gated recurrent unit neural network, a second probability value indicative of speech not being present in the first audio segment; and determining that the first probability value is greater than the second probability value. 5. A method comprising: determining, by one or more computer processors coupled to memory, an audio file associated with video content; generating a plurality of audio segments using the audio file, the plurality of audio segments comprising a first segment and a second segment, wherein the first segment and the second segment are consecutive segments; determining, using a recurrent neural network, that the first segment comprises first voice activity; determining, using the recurrent neural network, that the second segment comprises second voice activity; determining that voice activity is present between a first timestamp associated with the first segment and a second timestamp associated with the second segment; and generating an empty subtitle file comprising an indication that the voice activity is present between the first timestamp and the second timestamp. 6. The method of claim 5 , wherein the empty subtitle file does not include a transcription of the voice activity. 7. The method of claim 5 , further comprising: determining that a density of the first voice activity is equal to or greater than a threshold; and generating a high speech density notification associated with the first timestamp in the empty subtitle file. 8. The method of claim 5 , wherein the video content is a first version of the video content, the audio file is a first audio file, and the empty subtitle file is a first empty subtitle file, the method further comprising: determining a second audio file associated with a second version of the video content; generating a second empty subtitle file for the second version using the second audio file; determining that there is a discrepancy between the first empty subtitle file and the second empty subtitle file; and generating a manual review notification. 9. The method of claim 5 , further comprising: generating a first spectrogram using the first segment, and a second spectrogram using the second segment; processing the first spectrogram and the second spectrogram using the recurrent neural network; and associating a first voice activity present label with the first segment, and a second voice activity present label with the second segment. 10. The method of claim 5 , wherein the plurality of audio segments further comprises a third segment, the method further comprising: determining, using the recurrent neural network, that the third segment does not comprise voice activity; and associating a first voice activity present label with the first segment, a second voice activity present label with the second segment, and a voice activity not present label with the third segment. 11. The method of claim 5 , wherein determining, using the recurrent neural network, that the first segment comprises first voice activity comprises: determining, using the recurrent neural network, a first probability value indicative of voice activity being present in the first segment; determining, using the recurrent neural network, a second probability value indicative of voice activity not being present in the first segment; and determining that the first probability value is greater than the second probability value. 12. The method of claim 5 , wherein the first voice activity is in a first language, and the second voice activity is in a second language. 13. The method of claim 5 , wherein the first segment and the second segment are at least partially overlapping segments. 14. The method of claim 5 , wherein the recurrent neural network is a gated recurrent unit neural network. 15. The method of claim 5 , wherein the recurrent neural network is trained using data that is processed with an unsupervised filtering method to correct label noise. 16. A system comprising: memory configured to store computer-executable instructions; and at least one computer processor configured to access the memory and execute the computer-executable instructions to: determine an audio file associated with video content; generate a plurality of audio segments using the audio file, the plurality of audio segments comprising a first segment and a second segment, wherein the first segment and the second segment are consecutive segments; determine, using a recurrent neural network, that the first segment comprises first voice activity; determine, using the rec

Assignees

Inventors

Classifications

  • involving timestamps for synchronizing content · CPC title

  • by decomposing the content in the time domain, e.g. in time segments · CPC title

  • for displaying subtitles · CPC title

  • using neural networks, e.g. processing the feedback provided by the user · CPC title

  • involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams (arrangements characterised by components specially adapted for monitoring, identification or recognition of audio in broadcast systems H04H60/58) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11205445B1 cover?
Systems, methods, and computer-readable media are disclosed for systems and methods for language agnostic automated voice activity detection. Example methods may include determining an audio file associated with video content, generating a number of audio segments using the audio file, the plurality of audio segments including a first segment and a second segment, where the first segment and th…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L25/84. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 21 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).