Audio diarization system that segments audio input
US-9584946-B1 · Feb 28, 2017 · US
US10109280B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10109280-B2 |
| Application number | US-201715839190-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 12, 2017 |
| Priority date | Jul 17, 2013 |
| Publication date | Oct 23, 2018 |
| Grant date | Oct 23, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In a method of diarization of audio data, audio data is segmented into a plurality of utterances. Each utterance is represented as an utterance model representative of a plurality of feature vectors. The utterance models are clustered. A plurality of speaker models are constructed from the clustered utterance models. A hidden Markov model is constructed of the plurality of speaker models. A sequence of identified speaker models is decoded.
Opening claim text (preview).
The invention claimed is: 1. A method for obtaining a speaker-identified transcription from audio data of multiple speakers, the method comprising: obtaining the audio data and an unlabeled transcription of the audio data; separating the audio data into a sequence of utterances, wherein each utterance has acoustic features; clustering utterances having similar acoustic features; generating a hidden Markov model (HMM) from the clustered utterances; decoding the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers; determining the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to acoustic voiceprint models of known speakers; and labeling portions of the transcription corresponding to utterances of identified speakers with the speaker's identity to obtain the speaker-identified transcription. 2. The method according to claim 1 , further comprising: labeling portions of the transcription corresponding to utterances of speakers that are not identified with a tag unique for each unidentified speaker. 3. The method according to claim 1 , wherein the separating the audio data into a sequence of utterances comprises: separating the audio data into frames; and detecting voice activity on a frame by frame basis; and determining utterances as consecutive frames of voice activity separated by frames of no voice activity. 4. The method according to claim 3 , wherein the detecting voice activity of a frame comprises: comparing a characteristic of the audio in each frame to a range or a threshold; wherein the characteristic is one or more of a mean energy, a band energy, a peakiness, and a residual energy. 5. The method according to claim 3 , wherein the determining utterances further comprises: obtaining information from the transcription of the audio data; and verifying that the determined utterances correspond to the obtained information. 6. The method according to claim 5 , wherein the information comprises phonemes, words, or sentences spoken by a single speaker. 7. The method according to claim 5 , wherein the information comprises metadata associated with the transcription. 8. The method according to claim 1 , wherein the acoustic features are Mel-frequency cepstral coefficients (MFCC). 9. The method according to claim 1 , further comprising after the determining the identity of one or more of the multiple speakers: refining the acoustic voiceprint models of known speakers using the utterances of the identified speakers. 10. A non-transitory computer readable medium containing computer readable instructions that when executed by a processor of a computing device cause the computing device to perform a method comprising: obtaining the audio data and an unlabeled transcription of the audio data; separating the audio data into a sequence of utterances, wherein each utterance has acoustic features; clustering utterances having similar acoustic features; generating a hidden Markov model (HMM) from the clustered utterances; decoding the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers; determining the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to acoustic voiceprint models of known speakers; and labeling portions of the transcription corresponding to utterances of identified speakers with the speaker's identity to obtain the speaker-identified transcription. 11. A system for obtaining a speaker-identified transcription from audio data of multiple speakers, the system comprising: a database storing acoustic voiceprint models of known speakers and the audio data of multiple speakers; a speech-to-text server that generates an unlabeled transcription of the audio data; and a computing device communicatively coupled to the database and the speech-to-text server, the computing device comprising a processor, wherein the processor is configured by software to: obtain the audio data and the unlabeled transcription of the audio data; separate the audio data into a sequence of utterances, wherein each utterance has acoustic features; cluster utterances having similar acoustic features; generate a hidden Markov model (HMM) from the clustered utterances; decode the sequence of utterances using the HMM to associate each utterance with one of the multiple speakers; determine the identity of one or more of the multiple speakers by comparing the utterances associated with each of the multiple speakers to the acoustic voiceprint models of known speakers; and label portions of the transcription corresponding to utterances of identified speakers with the speaker's identity to obtain the speaker-identified transcription. 12. The system according to claim 11 , wherein the processor is further configured to: label portions of the transcription corresponding to utterances of speakers that are not identified with a tag unique for each unidentified speaker. 13. The system according to claim 11 , wherein to separate the audio data into a sequence of utterances, the processor is configured to: separate the audio data into frames; and detect voice activity on a frame by frame basis; and determine utterances as consecutive frames of voice activity separated by frames of no voice activity. 14. The system according to claim 13 , wherein to detect voice activity, the processor is configured to: compare a characteristic of the audio in each frame to a range or a threshold; wherein the characteristic is one or more of a mean energy, a band energy, a peakiness, and a residual energy. 15. The system according to claim 13 , wherein to determine utterances, the processor is further configure to: obtain information from the transcription of the audio data; and verify that the determined utterances correspond to the obtained information. 16. The system according to claim 15 , wherein the information comprises phonemes, words, or sentences spoken by a single speaker. 17. The system according to claim 15 , wherein the information comprises metadata associated with the transcription. 18. The system according to claim 11 , wherein the acoustic features are Mel-frequency cepstral coefficients (MFCC). 19. The system according to claim 11 , wherein after the determining the identity of one or more of the multiple speakers, the processor is configured to: refine the acoustic voiceprint models of known speakers using the utterances of the identified speakers.
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Marking · CPC title
Training, enrolment or model building · CPC title
using speaker recognition · CPC title
Decision making techniques; Pattern matching strategies · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.