Blind diarization of recorded calls with arbitrary number of speakers

US10109280B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10109280-B2
Application numberUS-201715839190-A
CountryUS
Kind codeB2
Filing dateDec 12, 2017
Priority dateJul 17, 2013
Publication dateOct 23, 2018
Grant dateOct 23, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In a method of diarization of audio data, audio data is segmented into a plurality of utterances. Each utterance is represented as an utterance model representative of a plurality of feature vectors. The utterance models are clustered. A plurality of speaker models are constructed from the clustered utterance models. A hidden Markov model is constructed of the plurality of speaker models. A sequence of identified speaker models is decoded.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for obtaining a speaker-identified transcription from audio data of multiple speakers, the method comprising: obtaining the audio data and an unlabeled transcription of the audio data; separating the audio data into a sequence of utterances, wherein each utterance has acoustic features; clustering utterances having similar acoustic features; generating a hidden Markov model (HMM) from the clustered utterances; decoding the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers; determining the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to acoustic voiceprint models of known speakers; and labeling portions of the transcription corresponding to utterances of identified speakers with the speaker's identity to obtain the speaker-identified transcription. 2. The method according to claim 1 , further comprising: labeling portions of the transcription corresponding to utterances of speakers that are not identified with a tag unique for each unidentified speaker. 3. The method according to claim 1 , wherein the separating the audio data into a sequence of utterances comprises: separating the audio data into frames; and detecting voice activity on a frame by frame basis; and determining utterances as consecutive frames of voice activity separated by frames of no voice activity. 4. The method according to claim 3 , wherein the detecting voice activity of a frame comprises: comparing a characteristic of the audio in each frame to a range or a threshold; wherein the characteristic is one or more of a mean energy, a band energy, a peakiness, and a residual energy. 5. The method according to claim 3 , wherein the determining utterances further comprises: obtaining information from the transcription of the audio data; and verifying that the determined utterances correspond to the obtained information. 6. The method according to claim 5 , wherein the information comprises phonemes, words, or sentences spoken by a single speaker. 7. The method according to claim 5 , wherein the information comprises metadata associated with the transcription. 8. The method according to claim 1 , wherein the acoustic features are Mel-frequency cepstral coefficients (MFCC). 9. The method according to claim 1 , further comprising after the determining the identity of one or more of the multiple speakers: refining the acoustic voiceprint models of known speakers using the utterances of the identified speakers. 10. A non-transitory computer readable medium containing computer readable instructions that when executed by a processor of a computing device cause the computing device to perform a method comprising: obtaining the audio data and an unlabeled transcription of the audio data; separating the audio data into a sequence of utterances, wherein each utterance has acoustic features; clustering utterances having similar acoustic features; generating a hidden Markov model (HMM) from the clustered utterances; decoding the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers; determining the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to acoustic voiceprint models of known speakers; and labeling portions of the transcription corresponding to utterances of identified speakers with the speaker's identity to obtain the speaker-identified transcription. 11. A system for obtaining a speaker-identified transcription from audio data of multiple speakers, the system comprising: a database storing acoustic voiceprint models of known speakers and the audio data of multiple speakers; a speech-to-text server that generates an unlabeled transcription of the audio data; and a computing device communicatively coupled to the database and the speech-to-text server, the computing device comprising a processor, wherein the processor is configured by software to: obtain the audio data and the unlabeled transcription of the audio data; separate the audio data into a sequence of utterances, wherein each utterance has acoustic features; cluster utterances having similar acoustic features; generate a hidden Markov model (HMM) from the clustered utterances; decode the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers; determine the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to the acoustic voiceprint models of known speakers; and label portions of the transcription corresponding to utterances of identified speakers with the speaker's identity to obtain the speaker-identified transcription. 12. The system according to claim 11 , wherein the processor is further configured to: label portions of the transcription corresponding to utterances of speakers that are not identified with a tag unique for each unidentified speaker. 13. The system according to claim 11 , wherein to separate the audio data into a sequence of utterances, the processor is configured to: separate the audio data into frames; and detect voice activity on a frame by frame basis; and determine utterances as consecutive frames of voice activity separated by frames of no voice activity. 14. The system according to claim 13 , wherein to detect voice activity, the processor is configured to: compare a characteristic of the audio in each frame to a range or a threshold; wherein the characteristic is one or more of a mean energy, a band energy, a peakiness, and a residual energy. 15. The system according to claim 13 , wherein to determine utterances, the processor is further configure to: obtain information from the transcription of the audio data; and verify that the determined utterances correspond to the obtained information. 16. The system according to claim 15 , wherein the information comprises phonemes, words, or sentences spoken by a single speaker. 17. The system according to claim 15 , wherein the information comprises metadata associated with the transcription. 18. The system according to claim 11 , wherein the acoustic features are Mel-frequency cepstral coefficients (MFCC). 19. The system according to claim 11 , wherein after the determining the identity of one or more of the multiple speakers, the processor is configured to: refine the acoustic voiceprint models of known speakers using the utterances of the identified speakers.

Assignees

Inventors

Classifications

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • Marking · CPC title

  • Training, enrolment or model building · CPC title

  • using speaker recognition · CPC title

  • G10L17/06Primary

    Decision making techniques; Pattern matching strategies · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10109280B2 cover?
In a method of diarization of audio data, audio data is segmented into a plurality of utterances. Each utterance is represented as an utterance model representative of a plurality of feature vectors. The utterance models are clustered. A plurality of speaker models are constructed from the clustered utterance models. A hidden Markov model is constructed of the plurality of speaker models. A seq…
Who is the assignee on this patent?
Verint Systems Ltd
What technology area does this patent fall under?
Primary CPC classification G10L17/06. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 23 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).