Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier

US10902843B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10902843-B2
Application numberUS-201916684970-A
CountryUS
Kind codeB2
Filing dateNov 15, 2019
Priority dateDec 14, 2016
Publication dateJan 26, 2021
Grant dateJan 26, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Audio features, such as perceptual linear prediction (PLP) features and time derivatives thereof, are extracted from frames of training audio data including speech by multiple speakers, and silence, such as by using linear discriminant analysis (LDA). The frames are clustered into k-means clusters using distance measures, such as Mahalanobis distance measures, of means and variances of the extracted audio features of the frames. A recurrent neural network (RNN) is trained on the extracted audio features of the frames and cluster identifiers of the k-means clusters into which the frames have been clustered. The RNN is applied to audio data to segment audio data into segments that each correspond to one of the cluster identifiers. Each segment can be assigned a label corresponding to one of the cluster identifiers. Speech recognition can be performed on the segments.

First claim

Opening claim text (preview).

We claim: 1. A computing system comprising: a processor; and a storage device to store audio data including speech by a plurality of speakers, and silence, the storage device storing computer-executable code that the processor is to execute to: segment the audio data using a recurrent neural network (RNN) to identify a plurality of change points of the audio data that divide the audio data into a plurality of segments, each change point being a transition from one of a plurality of speech feature cluster identifiers to a different one of the speech feature cluster identifiers. 2. The computing system of claim 1 , wherein the speech feature cluster identifiers correspond to a plurality of k-means clusters into which a plurality of frames of training audio data has been clustered. 3. The computing system of claim 2 , wherein the frames of the training audio data have been clustered into a predetermined number of the k-means clusters using Mahalanobis distance measures based on extracted audio features of the frames. 4. The computing system of claim 3 , wherein the extracted audio features comprise perceptual linear prediction (PLP) features of the frames and time derivatives of the PLP features. 5. The computing system of claim 4 , wherein the PLP features have been extracted from the frames via a linear discriminant analysis (LDA) of the frames. 6. The computing system of claim 3 , wherein the frames have been clustered into the predetermined number of the k-means clusters using the Mahalanobis distance measures of means and variances of the extracted audio features of the frames. 7. The computing system of claim 1 , wherein the processor is to execute the computer-executable code to further: assign a label selected from a group of labels to each segment of the audio data using the RNN, the group of labels comprising labels corresponding to the speech feature cluster identifiers. 8. The computing system of claim 1 , wherein the audio data further includes music. 9. The computing system of claim 1 , wherein the audio data comprises a plurality of frames, and wherein the processor is to segment the audio data by: while sequentially proceeding through the frames of the audio data, assigning a label selected from a group of labels to each frame of the audio data using the RNN, the group of labels comprising labels corresponding to the speech feature cluster identifiers; in response to assigning the label to a current frame of the audio data that is different than the label assigned to a preceding frame of the audio data, identifying a current change point. 10. The computing system of claim 9 , wherein the processor is to segment the audio data by further: in response to assigning the label to the current frame that is different than the label assigned to the preceding frame, demarcating an end of a preceding segment of the audio data at the current change point, the preceding segment having a start that a preceding change point demarcates. 11. The computing system of claim 1 , wherein speaker diarization in which the segmentation of the audio data occurs is performed separately from the speech recognition, improving performance of the computing system by permitting the speech recognition to be performed on an already identified segment of the audio data while the speaker diarization is identifying a next segment of the audio data. 12. The computing system of claim 11 , wherein the speaker diarization is performed concurrently or simultaneously with the speech recognition. 13. A computer program product comprising a computer-readable storage medium having program instructions embodied therewith, wherein the computer-readable storage medium is not a transitory signal per se, the program instructions executed by a computing device to: apply a recurrent neural network (RNN) model to audio data including speech by a plurality of speakers, and silence, application of the RNN model to the audio data segmenting the audio data into a plurality of segments, each segment corresponding to one of a plurality of speech feature cluster identifiers. 14. The computer program product of claim 13 , wherein the speech feature cluster identifiers correspond to a plurality of k-means clusters into which a plurality of frames of training audio data has been clustered via a linear discriminant analysis (LDA) of the frames. 15. The computer program product of claim 14 , wherein the frames of the training audio data have been clustered into a predetermined number of the k-means clusters using Mahalanobis distance measures of means and variances of the extracted audio features of the frames. 16. The computer program product of claim 15 , wherein the extracted audio features comprise perceptual linear prediction (PLP) features of the frames and time derivatives of the PLP features. 17. The computer program product of claim 13 , wherein the application of the RNN model to the audio data assigns a label selected from a group of labels to each segment of the audio data, the group of labels corresponding to the speech feature cluster identifiers. 18. The computer program product of claim 13 , wherein the audio data further includes music. 19. A method comprising: training, by a computing system, a recurrent neural network (RNN) on a plurality of extracted audio features of a plurality of frames and cluster identifiers of k-means clusters into which the frames have been clustered; and applying, by the computing system, the RNN to audio data to segment the audio data into a plurality of segments, each segment corresponding to one of the cluster identifiers. 20. The method of claim 19 , wherein extracting the audio features from the frames of the training audio data comprises extracting perceptual linear prediction (PLP) features of the frames and time derivatives of the PLP features using, linear discriminant analysis (LDA) of the frames, and wherein clustering the frames into the k-means clusters using the distance measures comprises clustering the frames into a predetermined number of the k-means clusters using Mahalanobis distance measures of the means and the variances of the extracted audio features of the frames.

Assignees

Inventors

Classifications

  • G10L15/04Primary

    Segmentation; Word boundary detection · CPC title

  • Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

  • using neural networks · CPC title

  • Artificial neural networks; Connectionist approaches · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10902843B2 cover?
Audio features, such as perceptual linear prediction (PLP) features and time derivatives thereof, are extracted from frames of training audio data including speech by multiple speakers, and silence, such as by using linear discriminant analysis (LDA). The frames are clustered into k-means clusters using distance measures, such as Mahalanobis distance measures, of means and variances of the extr…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G10L15/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 26 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).