Word-level blind diarization of recorded calls with arbitrary number of speakers

US9875742B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9875742-B2
Application numberUS-201615006572-A
CountryUS
Kind codeB2
Filing dateJan 26, 2016
Priority dateJan 26, 2015
Publication dateJan 23, 2018
Grant dateJan 23, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein are methods of diarizing audio data using first-pass blind diarization and second-pass blind diarization that generate speaker statistical models, wherein the first pass-blind diarization is on a per-frame basis and the second pass-blind diarization is on a per-word basis, and methods of creating acoustic signatures for a common speaker based only on the statistical models of the speakers in each audio session.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of creating an acoustic signature for a speaker from multiple audio sessions and for performing diarization, the method comprising: receiving, from an audio data source, audio data at an audio communications interface of a computing system, the audio data defining a training set containing a number of recorded audio sessions, wherein the computing system is configured to construct, from each audio session, a plurality of respective speaker models, wherein each speaker model is characterized by aggregating acoustic features into respective feature vectors that define a respective occupancy which is proportional to a total number of feature vectors used to construct the speaker model, and wherein the speaker models are Gaussian mixture models (GMMs) defined over a common set of Gaussian distributions that differ only by respective mixture probabilities for the acoustic features present in the feature vectors; classifying the plurality of speaker models to identify a set of common speaker GMMs and a set of generic speaker GMMs, wherein the classifying includes constructing an undirected similarity graph having vertices corresponding to the plurality of respective speaker models of all the recorded audio sessions in the training set and classifying the plurality of speaker models according to a degree of similarity between the corresponding vertices in the undirected similarity graph in relation to at least one threshold degree of similarity; generating an acoustic signature by at least: constructing a super-GMM for the set of common speaker GMMs, and constructing a second super-GMM for the set of generic speaker GMMs by generating a set of random vectors and training a second GMM over these random vectors, wherein a respective acoustic signature for a common speaker is given as a super-model pair of the two constructed super-GMMs; storing the two constructed super-GMMs in a computing system memory; receiving additional audio data at the audio communications interface; identifying the common speaker using the super-model pair; and labeling the additional audio data with an identified common speaker label. 2. A The method according to claim 1 , wherein the audio data comprises one of a .WAV format, a PCM format, and a LPCM format. 3. The method according to claim 1 , further comprising, decoding at least one segment of the audio data, transcribing the segment of audio data, producing a diarized transcript, and using a voice activity detector to classify the audio data into speech and non-speech segments by assessing a dynamic energy range for each segment of audio data. 4. The method according to claim 1 , wherein the audio data originates from a recording stored on a server. 5. The method according to claim 1 , wherein the audio data is a real time stream of audio data. 6. The method according to claim 1 , further comprising displaying the undirected graph on a graphical display connected to the computing system. 7. The method of for creating a plurality of acoustic signatures and for performing diarization, comprising: receiving audio data at a communication interface of a computing system on a frame by frame basis, creating a speech to text transcription of the audio data; clustering respective segments of the audio data according to word sequences; classifying the segments to identify a set of common speaker Gaussian mixture models (GMMs) and a set of generic speaker GMMs, wherein the classifying includes constructing an undirected similarity graph having vertices corresponding to a plurality of speaker models of previously recorded audio sessions in a training set; wherein the classifying further includes determining with a processor in the computing system a degree of similarity between the corresponding vertices in the undirected similarity graph in relation to at least one threshold degree of similarity; generating an acoustic signature by at least: constructing a super-GMM for the set of common speaker GMMs, and constructing a second super-GMM for the set of generic speaker GMMs by generating a set of random vectors and training a GMM over these random vectors, wherein the acoustic signature for respective common speakers is given as a super-model pair of the two constructed super-GMMs; and storing the two constructed super-GMMs in a computing system memory; receiving additional audio data at the communication interface; identifying a respective common speaker using the super-model pair; and labeling the additional audio data with an identified common speaker label. 8. The method according to claim 7 , further comprising utilizing a diagonal Gaussian distribution for the clusters to calculate the log likelihood. 9. The method according to claim 7 , wherein the prerecorded training sets reside on a server. 10. The method according to claim 7 , further comprising filtering short utterances as background audio. 11. The method according to claim 7 , further comprising filtering out short utterances on a time duration basis. 12. The method according to claim 7 , further comprising classifying the clusters according to Mel-frequency cepstral coefficients (MFCC) for each frame. 13. The method according to claim 7 , further comprising: determining a cluster of segments to be comprised of respective utterances and representing a distribution of feature vectors in the respective utterances; characterizing each feature vector in terms of its probability of being present in one of the clusters; calculating a distance metric between utterances according to the probability; identifying time between speakers in the audio stream. 14. The method according to claim 13 , further comprising: using distances between utterances to construct an affinity matrix based upon respective distances; computing a stochastic matrix from the affinity matrix; computing eigenvalues and corresponding eigenvectors of the stochastic matrix; and computing an embedding of the utterances into dimensional vectors; identifying embedded utterances in a frame as an additional speaker or as additional background audio. 15. The method according to claim 7 , further comprising using the processor to determine the degree of similarity by calculating a distance (δ) between the corresponding vertices of the speaker models. 16. The method according to claim 7 , further comprising displaying the undirected graph on a graphical display connected to the computing system.

Assignees

Inventors

Classifications

  • Speech to text systems (G10L15/08 takes precedence) · CPC title

  • for discriminating voice from noise · CPC title

  • G10L17/04Primary

    Training, enrolment or model building · CPC title

  • Hidden Markov models [HMM] · CPC title

  • Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9875742B2 cover?
Disclosed herein are methods of diarizing audio data using first-pass blind diarization and second-pass blind diarization that generate speaker statistical models, wherein the first pass-blind diarization is on a per-frame basis and the second pass-blind diarization is on a per-word basis, and methods of creating acoustic signatures for a common speaker based only on the statistical models of t…
Who is the assignee on this patent?
Verint Systems Ltd
What technology area does this patent fall under?
Primary CPC classification G10L17/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 23 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).