Adjusting audio and non-audio features based on noise metrics and speech intelligibility metrics
US-2023010466-A1 · Jan 12, 2023 · US
US12400676B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12400676-B2 |
| Application number | US-202217846864-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 22, 2022 |
| Priority date | Dec 23, 2019 |
| Publication date | Aug 26, 2025 |
| Grant date | Aug 26, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method comprises: obtaining a mixed soundtrack that includes dialogue mixed with non-dialogue sound; converting the mixed soundtrack to comparison text; obtaining reference text for the dialogue as a reference for intelligibility of the dialogue; determining a measure of intelligibility of the dialogue of the mixed soundtrack to a listener based on a comparison of the comparison text against the reference text; and reporting the measure of intelligibility of the dialogue.
Opening claim text (preview).
What is claimed is: 1. A method comprising: receiving an original mixed soundtrack that includes dialogue mixed with non-dialogue sound; acoustically modifying, by an audio-visual device, the original mixed soundtrack with emulated sound effects to produce a mixed soundtrack, wherein the emulated sound effects emulate frequency responses of one or more room acoustics, sound reproduction system playback acoustics, or background noise; converting the mixed soundtrack to comparison text using automatic speech recognition (ASR); obtaining reference text for the dialogue as a reference for intelligibility of the dialogue; determining a measure of intelligibility of the dialogue of the mixed soundtrack to a listener based on a comparison of the comparison text against the reference text, wherein the determining the measure of intelligibility of the dialogue includes; computing individual measures of intelligibility of the dialogue for time slices of the mixed soundtrack based on the comparison by determining differences between segments of the comparison text corresponding to the time slices of the mixed soundtrack and corresponding segments of the reference text; and computing the measure of intelligibility of the dialogue based on the individual measures of intelligibility of the dialogue; determining whether the measure of intelligibility of the dialogue indicates a degraded intelligibility; and in response to the measure of intelligibility of the dialogue indicating the degraded intelligibility, producing a second mixed soundtrack. 2. The method of claim 1 , wherein the reporting includes: displaying the measure of intelligibility of the dialogue and the individual measures of intelligibility of the dialogue. 3. The method of claim 1 , wherein the reporting includes: displaying the measure of intelligibility of the dialogue, the individual measures of Intelligibility of the dialogue, the segments of the comparison text, and the corresponding ones of the segments of the reference text. 4. The method of claim 1 , further comprising: generating metadata configured for a digital reproduction device and that includes at least the individual measures of intelligibility of the dialogue. 5. The method of claim 1 , wherein: the reference text includes chunks of subtitle text that span respective time intervals; and the determining the measure of intelligibility includes determining individual differences between (i) segments of the comparison text corresponding to the time slices of the mixed soundtrack, and (ii) corresponding ones of the chunks of subtitle text that convey common dialogue to the segments of the comparison text. 6. The method of claim 5 , further comprising: matching the segments of the comparison text to the corresponding ones of the chunks of subtitle text using a text matching algorithm that maximizes text similarity between each of the segments of the comparison text and matching ones of the chunks of subtitle text, wherein the determining the individual differences includes determining the individual differences based on results of the matching. 7. The method of claim 1 , wherein the obtaining the reference text includes converting a dialogue-only soundtrack to the reference text. 8. The method of claim 1 , wherein the obtaining the reference text includes receiving text-based subtitles of the dialogue as the reference text. 9. The method of claim 1 , wherein the converting includes: using a machine-learning dialogue extractor, extracting the dialogue from the mixed soundtrack to produce a predominantly dialogue soundtrack; and converting the predominantly dialogue soundtrack to the comparison text. 10. The method of claim 1 , wherein the determining the measure of intelligibility of the dialogue includes computing a difference between the comparison text and the reference text, and computing the measure of intelligibility of the dialogue based on the difference. 11. The method of claim 10 , wherein the computing the difference includes computing the difference as a text distance representative of differences in letters or words, or as a phonetic text distance representative of differences in sound. 12. The method of claim 10 , wherein the computing the difference includes: computing a first difference between the comparison text and the reference text using a first compare algorithm; computing a second difference between the comparison text and the reference text using a second compare algorithm that is different from the first compare algorithm; and computing the difference as a weighted combination of the first difference and the second difference. 13. An apparatus comprising: a processor configured to: receive an original mixed soundtrack that includes dialogue mixed with non-dialogue sound; acoustically modify, by an audio-visual device, the original mixed soundtrack with emulated sound effects to produce a mixed soundtrack, wherein the emulated sound effects emulate frequency responses of one or more room acoustics, sound reproduction system playback acoustics, or background noise; convert the mixed soundtrack to comparison text using automatic speech recognition (ASR); obtain reference text for the dialogue as a reference for intelligibility of the dialogue to a listener; compute individual measures of intelligibility of the dialogue of the mixed soundtrack based on a comparison between the comparison text and the reference text by determining differences between segments of the comparison text corresponding to time slices of the mixed soundtrack and corresponding segments of the reference text; compute an overall measure of intelligibility of the dialogue of the mixed soundtrack based on the individual measures of intelligibility of the dialogue; generate a report including the overall measure of intelligibility of the dialogue; determine whether the measure of intelligibility of the dialogue indicates a degraded intelligibility; and in response to the measure of intelligibility of the dialogue indicating the degraded intelligibility, producing a second mixed soundtrack. 14. The apparatus of claim 13 , wherein the processor is configured to obtain the reference text by receiving text-based subtitles of the dialogue as the reference text. 15. A non-transitory computer readable medium encoded with instructions that, when executed by a processor, cause the processor to: receive an original mixed soundtrack that includes dialogue mixed with non-dialogue sound; acoustically modify, by an audio-visual device, the original mixed soundtrack with emulated sound effects to produce a mixed soundtrack, wherein the emulated sound effects emulate frequency responses of one or more room acoustics, sound reproduction system playback acoustics, or background noise; convert time slices of the mixed soundtrack to comparison text using automatic speech recognition (ASR); obtain reference text for the dialogue as a reference for intelligibility of the dialogue; compute individual measures of intelligibility of the dialogue of the mixed soundtrack for the time slices based on differences between the comparison text and the reference text by determining differences between segments of the comparison text corresponding to the time slices of the mixed soundtrack and corresponding segments of the reference text; compute an overall measure of intelligibility of the dialogue of the mixed soundtrack based on the individual measures of intelligibility of the dialogue; generate a report including the overall measure of intelligibility of the dialogue and the individual measures
Related publications grouped by family.
Answers are generated from the same data shown on this page.