Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G10L15/26. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Audio-visual diarization to identify meeting attendees

US11875796B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11875796-B2
Application number	US-201916399081-A
Country	US
Kind code	B2
Filing date	Apr 30, 2019
Priority date	Apr 30, 2019
Publication date	Jan 16, 2024
Grant date	Jan 16, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer implemented method includes receiving information streams on a meeting server from a set of multiple distributed devices included in a meeting, receiving audio signals representative of speech by at least two users in at least two of the information streams, receiving at least one video signal of at least one user in the information streams, associating a specific user with speech in the received audio signals as a function of the received audio and video signals, and generating a transcript of the meeting with an indication of the specific user associated with the speech.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer implemented method comprising: receiving information streams on a meeting server from a set of multiple distributed devices included in a meeting; receiving audio signals representative of overlapped speech by at least two users in at least two of the information streams wherein at least one of the received audio signals received from the set of multiple distributed devices is from a first mobile device of a first user; receiving at least one video signal of at least one user in the information streams; associating specific users with their respective speech in the received audio signals as a function of the received audio and video signals by providing a fusion of the audio signals and video signal to a model to generate speaker and word hypotheses for each audio signal and use such hypotheses to provide separate audio streams for each user, each separate audio stream containing only speech from one of the users with a user ID; and generating a transcript of the meeting from the separate audio streams with an indication of the specific users associated with the overlapped speech. 2. The method of claim 1 wherein the multiple distributed devices comprise wireless devices associated with users in the meeting and wherein the model comprises a fusion model. 3. The method of claim 2 wherein the first mobile device includes a camera and provides the at least one video signal. 4. The method of claim 3 wherein the first mobile device processes the at least one video signal provided to identify that a user associated with the first mobile device is speaking. 5. The method of claim 3 wherein one of the at least one audio signal that is received from the first mobile device includes a tag identifying the user associated with the first mobile device as speaking. 6. The method of claim 1 wherein the multiple distributed devices include an ambient device having multiple microphones supported in a fixed configuration, each microphone providing one of the received audio signals. 7. The method of claim 6 wherein one of the at least one video signal that is received having a field of view configured to include multiple users in the meeting and provide the at least one video signal. 8. The method of claim 1 wherein the multiple distributed devices include a fixed camera having a view of one or more users in the meeting. 9. A machine-readable storage device having instructions for execution by a processor of a machine to cause the processor to perform operations to perform a method, the operations comprising: receiving information streams on a meeting server from a set of multiple distributed devices included in a meeting; receiving audio signals representative of overlapped speech by at least two users in at least two of the information streams wherein at least one of the received audio signals received from the set of multiple distributed devices is from a first mobile device of a first user; receiving at least one video signal of at least one user in the information streams; associating specific users with their respective speech in the received audio signals as a function of the received audio and video signals by providing a fusion of the audio signals and video signal to a model to generate speaker and word hypotheses for each audio signal and use such hypotheses to provide separate audio streams for each user, each separate audio stream containing only speech from one of the users with a user ID; and generating a transcript of the meeting from the separate audio streams with an indication of the specific users associated with the overlapped speech. 10. The device of claim 9 wherein a fusion model is used on the received audio and video signals to associate the specific user with the speech. 11. The device of claim 9 wherein the multiple distributed devices comprise wireless devices associated with users in the meeting. 12. The device of claim 11 wherein the first mobile device includes a camera and provides the at least one video signal. 13. The device of claim 12 wherein the first mobile device processes the at least one video signal provided to identify that a user associated with the first mobile device is speaking. 14. The device of claim 13 wherein one of the at least one audio signal that is received from the first mobile device includes a tag identifying the user associated with the first mobile device as speaking. 15. The device of claim 9 wherein the multiple distributed devices include an ambient device having multiple microphones supported in a fixed configuration, each microphone providing one of the received audio signals. 16. The device of claim 15 wherein one of the at least one video signal that is received having a field of view configured to include multiple users in the meeting and provide the at least one video signal. 17. A device comprising: a processor; and a memory device coupled to the processor and having a program stored thereon for execution by the processor to perform operations comprising: receiving information streams on a meeting server from a set of multiple distributed devices included in a meeting; receiving audio signals representative of overlapped speech by at least two users in at least two of the information streams wherein at least one of the received audio signals received from the set of multiple distributed devices is from a first mobile device of a first user; receiving at least one video signal of at least one user in the information streams; associating specific users with their respective speech in the received audio signals as a function of the received audio and video signals by providing a fusion of the audio signals and video signal to a model to generate speaker and word hypotheses for each audio signal and use such hypotheses to provide separate audio streams for each user, each separate audio stream containing only speech from one of the users with a user ID; and generating a transcript of the meeting from the separate audio streams with an indication of the specific users associated with the overlapped speech. 18. The device of claim 17 wherein a fusion model is used on the received audio and video signals to associate the specific user with the speech and wherein the multiple distributed devices comprise wireless devices associated with users in the meeting, wherein the first mobile device includes a camera and provides the at least one video signal, and wherein the first mobile device processes the at least one video signal provided to identify that a user associated with the first mobile device is speaking. 19. The device of claim 18 wherein one of the at least one audio signal that is received from the first mobile device includes a tag identifying the user associated with the first mobile device as speaking. 20. The device of claim 17 wherein the multiple distributed devices include an ambient device having multiple microphones supported in a fixed configuration, each microphone providing one of the received audio signals.

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G10L15/26Primary
Speech to text systems (G10L15/08 takes precedence) · CPC title
G10L15/22
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
H04L65/403
Arrangements for multi-party communication, e.g. for conferences (data switching systems for conference H04L12/18; arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities H04M3/56; television conferencing systems H04N7/15) · CPC title
H04N7/15
Conference systems · CPC title
H04R1/406
microphones · CPC title

Patent family

Related publications grouped by family.

View patent family 70293061

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11875796B2 cover?: A computer implemented method includes receiving information streams on a meeting server from a set of multiple distributed devices included in a meeting, receiving audio signals representative of speech by at least two users in at least two of the information streams, receiving at least one video signal of at least one user in the information streams, associating a specific user with speech in…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).