What technology area does this patent fall under?

Primary CPC classification G10L15/26. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu May 23 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Diarization Driven by Meta-Information Identified in Discussion Content

US2019156835A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2019156835-A1
Application number	US-201715819158-A
Country	US
Kind code	A1
Filing date	Nov 21, 2017
Priority date	Nov 21, 2017
Publication date	May 23, 2019
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An approach is provided that receives an audio stream and utilizes a voice activation detection (VAD) process to create a digital audio stream of voices from at least two different speakers. An automatic speech recognition (ASR) process is applied to the digital stream with the ASR process resulting in the spoken words to which a speaker turn detection (STD) process is applied to identify a number of speaker segments with each speaker segment ending at a word boundary. The STD process analyzes a number of speaker segments using a language model that determines when speaker changes occur. A speaker clustering algorithm is then applied to the speaker segments to associate one of the speakers with each of the speaker segments.

First claim

Opening claim text (preview).

1 . A method implemented by an information handling system that includes a memory and a processor, the method comprising: receiving an audio stream that comprises both a plurality of speech segments corresponding to a plurality of human speakers and a plurality of non-verbal segments; utilizing a voice activation detection (VAD) process on the audio stream, wherein an output of the VAD process is a digital audio stream of voices corresponding to the plurality of speech segments; applying an automatic speech recognition (ASR) process to the digital stream, wherein the ASR process results in a plurality of spoken words; inputting the VAD process output into an automatic speech recognition (ASR) process, wherein an output of the ASR process comprises a plurality of spoken words corresponding to the plurality of speech segments and is devoid of the plurality of non-verbal segments inputting the ASR process output to a speaker turn detection (STD) process to the plurality of spoken words, wherein a plurality of speaker segments of contiguous words are selected from the plurality of spoken words and analyzed by a language model that determines when a plurality of speaker changes occur; and applying a speaker clustering algorithm to the plurality of speaker segments, wherein the speaker clustering algorithm associates an identifier of one of the human speakers with each of the speaker segments. 2 . The method of claim 1 further comprising: associating a first word from the plurality of spoken words to a first set of vocal qualities; identifying a second word from the plurality of spoken words that is successive to the first word and corresponds to a second set of vocal qualities; inserting a speaker change mark between the first word and the second word in response to determining that the first set of vocal qualities is different from the second set of vocal qualities; analyzing, during the STD process, a selected one of the speaker segments corresponding to the first word based on a set of previous speaker segments, wherein the selected speaker segment is a question; calculating a speaker change value based on the language model analysis; and in response to determining that a speaker change occurs based on the speaker change value, maintaining the speaker change mark between the first word and the second word. 3 . The method of claim 2 further comprising: increasing the speaker change value in response to the language model analysis revealing that the selected speaker segment is a statement; increasing the speaker change value in response to the language model analysis revealing that the selected speaker segment is a reply; and decreasing the speaker change value in response to the language model analysis revealing that the selected speaker segment is a continuation of one or more of the previous speaker segments. 4 . The method of claim 1 further comprising: appending the selected speaker segment to the set of previous speaker segments; selecting a next one of the speaker segments; analyzing the selected next speaker segment during the STD process based on the set of previous speaker segment that now includes the selected speaker segment; analyzing a second speaker change value based on the language model analysis; and determining whether a second speaker change occurs based on the second speaker change value. 5 . The method of claim 1 further comprising: identifying a plurality of sets of vocal qualities from the audio stream, wherein each of the sets of vocal qualities corresponds to a different one of the plurality of human speakers; comparing the plurality of sets of vocal qualities to each of the plurality of spoken words; and associating one of the human speakers to each of the words based on the comparison. 6 . The method of claim 5 further comprising: identifying the speaker change occurs; failing to detect a corresponding speaker change based on the sets of vocal qualities from the audio stream; and reanalyzing a location of the identified speaker change occurrence with respect to a vocal quality change found between two successive words at the location; and determining whether the speaker change occurred at the location based on the reanalysis. 7 . The method of claim 1 further comprising: generating a transcript of the audio stream that includes the plurality of speaker segments and an association of each of the speaker segments to one of the human speakers; and ingesting the transcript into a corpus utilized by a question answering (QA) system. 8 . An information handling system comprising: one or more processors; a memory coupled to at least one of the processors; and a set of computer program instructions stored in the memory and executed by at least one of the processors in order to perform actions of: receiving an audio stream that comprises both a plurality of speech segments corresponding to a plurality of human speakers and a plurality of non-verbal segments; utilizing a voice activation detection (VAD) process on the audio stream, wherein an output of the VAD process is a digital audio stream of voices corresponding to the plurality of speech segments; applying an automatic speech recognition (ASR) process to the digital stream, wherein the ASR process results in a plurality of spoken words; inputting the VAD process output into an automatic speech recognition (ASR) process, wherein an output of the ASR process comprises a plurality of spoken words corresponding to the plurality of speech segments and is devoid of the plurality of non-verbal segments inputting the ASR process output to a speaker turn detection (STD) process to the plurality of spoken words, wherein a plurality of speaker segments of contiguous words are selected from the plurality of spoken words and analyzed by a language model that determines when a plurality of speaker changes occur; and applying a speaker clustering algorithm to the plurality of speaker segments, wherein the speaker clustering algorithm associates an identifier of one of the human speakers with each of the speaker segments. 9 . The information handling system of claim 8 wherein the actions further comprise: associating a first word from the plurality of spoken words to a first set of vocal qualities; identifying a second word from the plurality of spoken words that is successive to the first word and corresponds to a second set of vocal qualities; inserting a speaker change mark between the first word and the second word in response to determining that the first set of vocal qualities is different from the second set of vocal qualities; analyzing, during the STD process, a selected one of the speaker segments corresponding to the first word based on a set of previous speaker segments, wherein the selected speaker segment is a question; calculating a speaker change value based on the language model analysis; and in response to determining that a speaker change occurs based on the speaker change value, maintaining the speaker change mark between the first word and the second word. 10 . The information handling system of claim 9 wherein the actions further comprise: increasing the speaker change value in response to the language model analysis revealing that the selected speaker segment is a statement; increasing the speaker change value in response to the language model analysis revealing that the selected speaker segment is a reply; and decreasing the speaker change value in response to the language model analysis revealing that the selected speaker segment is a continuation of one or more of the previous speaker segments. 11 . The information handling system of

Assignees

Inventors

Classifications

G10L17/06
Decision making techniques; Pattern matching strategies · CPC title
G06F40/35
Discourse or dialogue representation · CPC title
G10L17/04
Training, enrolment or model building · CPC title
G10L15/26Primary
Speech to text systems (G10L15/08 takes precedence) · CPC title
G10L15/30
Distributed recognition, e.g. in client-server systems, for mobile phones or network applications · CPC title

Patent family

Related publications grouped by family.

View patent family 66532452

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2019156835A1 cover?: An approach is provided that receives an audio stream and utilizes a voice activation detection (VAD) process to create a digital audio stream of voices from at least two different speakers. An automatic speech recognition (ASR) process is applied to the digital stream with the ASR process resulting in the spoken words to which a speaker turn detection (STD) process is applied to identify a num…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu May 23 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Speaker identification assisted by categorical cues

Blind diarization of recorded calls with arbitrary number of speakers

Automatic Question Generation and Answering Based on Monitored Messaging Sessions

Word-level blind diarization of recorded calls with arbitrary number of speakers

Call flow and discourse analysis

Frequently asked questions