Diarization driven by the ASR based segmentation

US11120802B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11120802-B2
Application numberUS-201715819127-A
CountryUS
Kind codeB2
Filing dateNov 21, 2017
Priority dateNov 21, 2017
Publication dateSep 14, 2021
Grant dateSep 14, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An approach is provided that receives an audio stream and utilizes a voice activation detection (VAD) process to create a digital audio stream of voices from at least two different speakers. An automatic speech recognition (ASR) process is applied to the digital stream with the ASR process resulting in the spoken words to which a speaker turn detection (STD) process is applied to identify a number of speaker segments with each speaker segment ending at a word boundary. A speaker clustering algorithm is then applied to the speaker segments to associate one of the speakers with each of the speaker segments.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method implemented by an information handling system that includes a memory and a processor, the method comprising: receiving an audio stream that comprises both a plurality of speech segments corresponding to a plurality of human speakers and a plurality of non-verbal segments; utilizing a voice activation detection (VAD) process on the audio stream, wherein an output of the VAD process is a digital audio stream of voices corresponding to the plurality of speech segments; inputting the VAD process output into an automatic speech recognition (ASR) process, wherein an output of the ASR process comprises a plurality of spoken words corresponding to the plurality of speech segments and is devoid of the plurality of non-verbal segments; inputting the ASR process output to a speaker turn detection (STD) process, wherein the STD process generates a plurality of speaker segments that each end at a word boundary of one of the plurality of spoken words; and applying a speaker clustering algorithm to the plurality of speaker segments, wherein the speaker clustering algorithm associates an identifier of one of the human speakers with each of the speaker segments. 2. The method of claim 1 further comprising: generating a textual transcript of the audio stream by outputting each of the speaker segments and the identifier of the associated human speaker. 3. The method of claim 1 further comprising: ingesting the textual transcript into a question answering (QA) system corpus. 4. The method of claim 1 further comprising: identifying a plurality of sets of vocal qualities from the audio stream, wherein each of the sets of vocal qualities corresponds to a different one of the plurality of human speakers; comparing the plurality of sets of vocal qualities to each of the plurality of spoken words; and associating one of the human speakers to each of the words based on the comparison. 5. The method of claim 4 wherein a change from a first of the plurality of human speaker to a second of the plurality of human speakers is limited to word boundaries found in the plurality of spoken words. 6. The method of claim 1 wherein the speaker detection process further comprises: associating a first word from the plurality of spoken words to a first set of vocal qualities; identifying a second word from the plurality of spoken words that is successive to the first word and corresponds to a second set of vocal qualities; inserting a speaker change mark between the first word and the second word in response to determining that the first set of vocal qualities is different from the second set of vocal qualities; adjusting a speaker change probability value in response to determining that the first word is at an end of a question; and maintaining the speaker change mark between the first word and the second word based on the adjusted speaker change probability value. 7. The method of claim 6 further comprising: analyzing a selected one of the speaker segments corresponding to the first word using a language model, wherein the analysis: increases the speaker change probability value in response to the selected speaker segment indicating a statement; increases the speaker change probability value in response to the selected speaker segment indicating a reply; and decreases the speaker change probability value in response to the selected speaker segment indicating a continuation of a previous speaker segment; and identifying the second word based on the speaker change probability value and the comparison of the second word to the first set of vocal qualities. 8. An information handling system comprising: one or more processors; a memory coupled to at least one of the processors; and a set of computer program instructions stored in the memory and executed by at least one of the processors in order to perform actions of: receiving an audio stream that comprises both a plurality of speech segments corresponding to a plurality of human speakers and a plurality of non-verbal segments; utilizing a voice activation detection (VAD) process on the audio stream, wherein an output of the VAD process is a digital audio stream of voices corresponding to the plurality of speech segments; inputting the VAD process output into an automatic speech recognition (ASR) process, wherein an output of the ASR process comprises a plurality of spoken words corresponding to the plurality of speech segments and is devoid of the plurality of non-verbal segments; inputting the ASR process output to a speaker turn detection (STD) process, wherein the STD process generates a plurality of speaker segment that each end at a word boundary of one of the plurality of spoken words; and applying a speaker clustering algorithm to the plurality of speaker segments, wherein the speaker clustering algorithm associates an identifier of one of the human speakers with each of the speaker segments. 9. The information handling system of claim 8 wherein the actions further comprise: generating a textual transcript of the audio stream by outputting each of the speaker segments and the identifier of the associated human speaker. 10. The information handling system of claim 8 wherein the actions further comprise: ingesting the textual transcript into a question answering (QA) system corpus. 11. The information handling system of claim 8 wherein the actions further comprise: identifying a plurality of sets of vocal qualities from the audio stream, wherein each of the sets of vocal qualities corresponds to a different one of the plurality of human speakers; comparing the plurality of sets of vocal qualities to each of the plurality of spoken words; and associating one of the human speakers to each of the words based on the comparison. 12. The information handling system of claim 11 wherein a change from a first of the plurality of human speaker to a second of the plurality of human speakers is limited to word boundaries found in the plurality of spoken words. 13. The information handling system of claim 8 wherein the actions further comprise: associating a first word from the plurality of spoken words to a first set of vocal qualities; identifying a second word from the plurality of spoken words that is successive to the first word and corresponds to a second set of vocal qualities; inserting a speaker change mark between the first word and the second word in response to determining that the first set of vocal qualities is different from the second set of vocal qualities; adjusting a speaker change probability value in response to determining that the first word is at an end of a question; and maintaining the speaker change mark between the first word and the second word based on the adjusted speaker change probability value. 14. The information handling system of claim 13 wherein the actions further comprise: analyzing a selected one of the speaker segments corresponding to the first word using a language model, wherein the analysis: increases the speaker change probability value in response to the selected speaker segment indicating a statement; increases the speaker change probability value in response to the selected speaker segment indicating a reply; and decreases the speaker change probability value in response to the selected speaker segment indicating a continuation of a previous speaker segment; and identifying the second word based on the speaker change probability value and the comparison of the second word to the first set of vocal qualities. 15. A computer program product stored in a

Assignees

Inventors

Classifications

  • Speaker identification or verification techniques · CPC title

  • G10L15/26Primary

    Speech to text systems (G10L15/08 takes precedence) · CPC title

  • for comparison or discrimination · CPC title

  • Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

  • Training, enrolment or model building · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11120802B2 cover?
An approach is provided that receives an audio stream and utilizes a voice activation detection (VAD) process to create a digital audio stream of voices from at least two different speakers. An automatic speech recognition (ASR) process is applied to the digital stream with the ASR process resulting in the spoken words to which a speaker turn detection (STD) process is applied to identify a num…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 14 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).