Automatic volume attenuation for speech enabled devices
US-9324322-B1 · Apr 26, 2016 · US
US9536547B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9536547-B2 |
| Application number | US-201514875092-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 5, 2015 |
| Priority date | Oct 17, 2014 |
| Publication date | Jan 3, 2017 |
| Grant date | Jan 3, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A speaker change detection device sets first and second analysis periods before and after each of time points in a voice signal, generates, for each of the time points, a first speaker model from a distribution of features in frames in the first analysis period, and a second speaker model from a distribution of features in frames in the second analysis period, calculates, for each of the time points, a matching score representing the likelihood of similarity of features between a group of speakers in the first analysis period and a group of speakers in the second analysis period by applying the features extracted from the second analysis period to the first speaker model and applying the features extracted from the first analysis period to the second speaker model, and detects a speaker change point on the basis of the matching scores at the plurality of time points.
Opening claim text (preview).
What is claimed is: 1. A speaker change detection device comprising: a processor configured to: extract features representing features of a human voice in each frame having a predetermined time length from a voice signal including a conversation between a plurality of speakers; set, for each of a plurality of different time points in the voice signal, a first analysis period before the time point and a second analysis period after the time point; generate, for each of the plurality of time points, a first speaker model representing features of voices of a group of at least two speakers speaking in the first analysis period on the basis of a distribution of the features of a plurality of frames included in the first analysis period and a second speaker model representing features of voices of a group of at least two speakers speaking in the second analysis period on the basis of a distribution of the features in a plurality of frames included in the second analysis period; calculate, for each of the plurality of time points, a matching score representing the likelihood of similarity of features between the group of speakers in the first analysis period and the group of speakers in the second analysis period by applying the features in a plurality of frames included in the second analysis period to the first speaker model and applying the features of a plurality of frames included in the first analysis period to the second speaker model; and detect a speaker change point at which a change from a group of speakers speaking before the speaker change point to another group of speakers speaking after the speaker change point occurs in the voice signal on the basis of the matching score for each of the plurality of time points. 2. The speaker change detection device according to claim 1 , wherein, the detecting the speaker change point, when a local minimum matching score in a time sequence among the matching scores for the plurality of time points is lower than or equal to a predetermined threshold, detects a time point corresponding to the local minimum matching score as the speaker change point. 3. The speaker change detection device according to claim 1 , wherein, the processor is further configured to: extend, when a local minimum matching score in a time sequence among the matching scores for the plurality of time points is lower than or equal to a predetermined threshold, at least one of the first analysis period and the second analysis period for a first time point corresponding to the local minimum matching score in a direction away from the first time point; update, for each extended analysis period for the first time point, one of the first speaker model and the second speaker model that corresponds to the extended analysis period on the basis of a distribution of the features in a plurality of frames included in the extended analysis period; update the matching score of the first time point, when only one of the first analysis period or the second analysis period for the first time point is extended, by applying the features in a plurality of frames included in the extended one of the first analysis period and the second analysis period for the first time point to the speaker model of the other of the first analysis period and the second analysis period and applying the features in a plurality of frames included in the other analysis period to the updated speaker model; and update the matching score of the first time point, when both of the first analysis period and the second analysis period for the first time point are extended, by applying the features in a plurality of frames included in the extended first analysis period to the updated second speaker model and applying the features in a plurality of frames included in the extended second analysis period to the updated first speaker model, wherein the detect the speaker change point detects the first time point as the speaker change point when the updated matching score is lower than or equal to the predetermined detection threshold. 4. A speaker change detection method comprising: extracting, by a processor, features representing features of human voice in each frame having a predetermined time length from a voice signal including a conversation between a plurality of speakers; setting, by the processor, for each of a plurality of different time points in the voice signal, a first analysis period before the time point and a second analysis period after the time point; generating, by the processor, for each of the plurality of time points, a first speaker model representing features of voices of a group of at least two speakers speaking in the first analysis period on the basis of a distribution of the features of a plurality of frames included in the first analysis period and a second speaker model representing features of voices of a group of at least two speakers speaking in the second analysis period on the basis of a distribution of the features in a plurality of frames included in the second analysis period; calculating, by the processor, for each of the plurality of time points, a matching score representing the likelihood of similarity of features between the group of speakers in the first analysis period and the group of speakers in the second analysis period by applying the features in a plurality of frames included in the second analysis period to the first speaker model and applying the features of a plurality of frames included in the first analysis period to the second speaker model; and detecting, by the processor, a speaker change point at which a change from a group of speakers speaking before the speaker change point to another group of speakers speaking after the speaker change point occurs in the voice signal on the basis of the matching score for each of the plurality of time points. 5. The speaker change detection method according to claim 4 , wherein, the detecting the speaker change point, when a local minimum matching score in a time sequence among the matching scores for the plurality of time points is lower than or equal to a predetermined threshold, detects a time point corresponding to the local minimum matching score as the speaker change point. 6. The speaker change detection method according to claim 4 , further comprising: extending, when a local minimum matching score in a time sequence among the matching scores for the plurality of time points is lower than or equal to a predetermined threshold, at least one of the first analysis period and the second analysis period for a first time point corresponding to the local minimum matching score in a direction away from the first time point; updating, for each extended analysis period for the first time point, one of the first speaker model and the second speaker model that corresponds to the extended analysis period on the basis of a distribution of the features in a plurality of frames included in the extended analysis period; updating the matching score of the first time point, when only one of the first analysis period or the second analysis period for the first time point is extended, by applying the features in a plurality of frames included in the extended one of the first analysis period and the second analysis period for the first time point to the speaker model of the other of the first analysis period and the second analysis period and applying the features in a plurality of frames included in the other analysis period to the updated speaker model; and updating the matching score of the first time point, when both of the first analysis period and the second analysis period for the first time point are extended, by applying the features in a plurality of frames included in the extended first analysis period to the updated second speaker model a
characterised by the type of analysis window · CPC title
Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices · CPC title
for comparison or discrimination · CPC title
Decision making techniques; Pattern matching strategies · CPC title
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.