Systems and methods for generating a singular voice audio stream

US11328722B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11328722-B2
Application numberUS-202016788067-A
CountryUS
Kind codeB2
Filing dateFeb 11, 2020
Priority dateFeb 11, 2020
Publication dateMay 10, 2022
Grant dateMay 10, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An electronic device associated with a media-providing service receives a first set of audio streams corresponding to a plurality of microphones. The electronic device generates a second set of audio streams from the first set of audio streams. The second set of audio streams corresponds to a plurality of independent voices and in some cases, ambient noise. The electronic device detects a beginning of a voice command to play media content from the media-providing service in a first audio stream. The electronic device also detects an end of the voice command in the first audio stream. The end of the voice command overlaps with speech in a second audio stream in the second set of audio streams. In response to detecting the voice command, the electronic device plays the media content from the media-providing service.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: at an electronic device associated with a media-providing service, the electronic device having one or more processors, and memory storing instructions for execution by the one or more processors: receiving a first set of audio streams, each audio stream in the first set of audio streams corresponding to a respective microphone of a plurality of microphones, wherein each audio stream of the first set of audio streams includes audio from a plurality of independent voices; generating a second set of audio streams by performing blind source separation on a rolling window applied to the first set of audio streams, the rolling window having a length of time that captures a wake word, wherein each audio stream in the second set of audio streams corresponds to a respective independent voice of the plurality of independent voices; detecting, in a first audio stream of the second set of audio streams: the wake word; a beginning of a voice command to play media content from the media-providing service; and an end of the voice command, wherein the end of the voice command overlaps in time with speech in a second audio stream of the second set of audio streams; and in response to detecting the voice command, playing the media content from the media-providing service. 2. The method of claim 1 , wherein: the plurality of microphones is dynamically adjusted to produce distinct beamforming arrays. 3. The method of claim 2 , wherein the first set of audio streams is received from the beamforming arrays of the dynamically adjusted microphones. 4. The method of claim 1 , wherein generating the second set of audio streams comprises identifying statistically independent signals in the first set of audio streams. 5. The method of claim 4 , wherein generating the second set of audio streams further comprises performing independent component analysis (ICA) on the first set of audio streams. 6. The method of claim 1 , wherein generating the second set of audio streams comprises performing independent component analysis (ICA) on each audio stream of the first set of audio streams in real-time. 7. The method of claim 1 , wherein: detecting the end of the voice command includes, while detecting speech content in a second audio stream of the second set of audio streams, detecting a pause in speech in the first audio stream of the second set of audio streams. 8. The method of claim 1 , further comprising: storing, as a training set, the second set of audio streams corresponding to respective independent voices. 9. The method of claim 8 , further comprising: generating a voice-specific filter for each of the respective independent voices of the second set of audio streams. 10. The method of claim 9 , further comprising: applying the generated voice specific filter for a respective audio stream of the second set of audio streams corresponding to the respective independent voice to determine the voice command. 11. The method of claim 1 , wherein the electronic device is a first electronic device, and wherein the playing the media content from the media-providing service further comprises playing the media content at a second electronic device distinct from the first electronic device. 12. A non-transitory computer-readable storage medium storing one or more programs configured for execution by an electronic device associated with a media-providing service, the electronic device having one or more processors, the one or more programs including instructions, which when executed by the one or more processors, cause the electronic device to: receive a first set of audio streams, each audio stream in the first set of audio streams corresponding to a respective microphone of a plurality of microphones, wherein each audio stream of the first set of audio streams includes audio from a plurality of independent voices; generate a second set of audio streams by performing blind source separation on a rolling window applied to the first set of audio streams, the rolling window having a length of time that captures a wake word, wherein each audio stream in the second set of audio streams corresponds to a respective independent voice of the plurality of independent voices; detect, in a first audio stream of the second set of audio streams: the wake word; a beginning of a voice command to play media content from the media-providing service; and an end of the voice command, wherein the end of the voice command overlaps in time with speech in a second audio stream of the second set of audio streams; and in response to detecting the voice command, play the media content from the media-providing service. 13. An electronic device associated with a media-providing service, comprising: one or more processors; and memory storing one or more programs, the one or more programs including instructions, which when executed by the one or more processors, cause the electronic device to: receive a first set of audio streams, each audio stream in the first set of audio streams corresponding to a respective microphone of a plurality of microphones, wherein each audio stream of the first set of audio streams includes audio from a plurality of independent voices; generate a second set of audio streams by performing blind source separation on a rolling window applied to the first set of audio streams, the rolling window having a length of time that captures a wake word, wherein each audio stream in the second set of audio streams corresponds to a respective independent voice of the plurality of independent voices; detect, in a first audio stream of the second set of audio streams: the wake word; a beginning of a voice command to play media content from the media-providing service; and an end of the voice command, wherein the end of the voice command overlaps in time with speech in a second audio stream of the second set of audio streams; and in response to detecting the voice command, play the media content from the media-providing service. 14. The electronic device of claim 13 , wherein: the plurality of microphones is dynamically adjusted to produce distinct beamforming arrays. 15. The electronic device of claim 14 , wherein the first set of audio streams is received from the beamforming arrays of the dynamically adjusted microphones. 16. The electronic device of claim 13 , wherein generating the second set of audio streams comprises identifying statistically independent signals in the first set of audio streams. 17. The electronic device of claim 16 , wherein generating the second set of audio streams further comprises performing independent component analysis (ICA) on the first set of audio streams. 18. The electronic device of claim 13 , wherein generating the second set of audio streams comprises performing independent component analysis (ICA) on each audio stream of the first set of audio streams in real-time. 19. The electronic device of claim 13 , wherein: detecting the end of the voice command includes, while detecting speech content in a second audio stream of the second set of audio streams, detecting a pause in speech in the first audio stream of the second set of audio streams. 20. The electronic device of claim 13 , the one or more programs further including instructions, which when executed by the one or more processors, cause the device to: store, as a training set, the second set of audio streams corresponding to respective independent voices.

Assignees

Inventors

Classifications

  • Voice signal separating · CPC title

  • Execution procedure of a spoken command · CPC title

  • G10L25/87Primary

    Detection of discrete points within a voice signal · CPC title

  • Management of the audio stream, e.g. setting of volume, audio stream path · CPC title

  • Word spotting · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11328722B2 cover?
An electronic device associated with a media-providing service receives a first set of audio streams corresponding to a plurality of microphones. The electronic device generates a second set of audio streams from the first set of audio streams. The second set of audio streams corresponds to a plurality of independent voices and in some cases, ambient noise. The electronic device detects a begin…
Who is the assignee on this patent?
Spotify Ab
What technology area does this patent fall under?
Primary CPC classification G10L25/87. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 10 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).