User Interface Tile Arrangement Based On Relative Locations Of Conference Participants
US-2023081717-A1 · Mar 16, 2023 · US
US11800057B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11800057-B2 |
| Application number | US-202117646704-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 31, 2021 |
| Priority date | Dec 31, 2021 |
| Publication date | Oct 24, 2023 |
| Grant date | Oct 24, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In a multi-camera videoconferencing configuration, the locations of each camera are known. By referencing a known object visible to each camera, a 3D coordinate system is developed, with the position and angle of each camera being associated with that 3D coordinate system. The locations of the conference participants in the 3D coordinate system are determined for each camera. Sound source localization (SSL) from one camera, generally a central camera, is used to determine the speaker. The pose of the speaker is then determined. From the pose and the known locations of the cameras, the camera with the best frontal view of the speaker is determined. The 3D coordinates of the speaker are then used to direct the determined camera to frame the speaker. If the face of the speaker is not sufficiently visible, the next best camera view is determined, and the speaker framed from that camera view.
Opening claim text (preview).
The invention claimed is: 1. A method for selecting a camera of a plurality of cameras, each with a different view of a group of participants in an environment and providing a video stream, one camera of the plurality of cameras having a microphone array, to provide a video stream for provision to a far end, the method comprising: determining world coordinates of each participant for each camera of the plurality of cameras; utilizing sound source localization using the microphone array on the one camera to determine speaker direction information; identifying a speaker in the group of participants using the speaker direction information and an image from the video stream of the one camera; determining world coordinates of the speaker based on the identification; determining facial pose of the speaker in the image from the video stream of the one camera; selecting a camera from the plurality of cameras to provide a video stream for provision to the far end based on the locations of the plurality of cameras other than the one camera and the facial pose of the speaker; and utilizing the determined speaker world coordinates to frame the speaker in the video stream of the selected camera. 2. The method of claim 1 , further comprising: determining the rotation and translation of a coordinate system of each of the plurality of cameras to the world coordinate system. 3. The method of claim 1 , further comprising: selecting the camera of the plurality of cameras providing the most frontal views of participants when there is not a speaker and there are participants; and selecting a default camera when there are no participants. 4. The method of claim 1 , wherein determining the world coordinates of each participant includes storing the determined world coordinates of each participant in a table of cameras and individuals from the perspective of the camera, and wherein utilizing the determined speaker world coordinates to frame the speaker includes using the determined speaker world coordinates to find the appropriate individual for the selected camera from the table. 5. The method of claim 1 , further comprising: determining if the frontal view of the speaker provided from the selected camera is satisfactory; and providing a framed view of the speaker from the selected camera when the frontal view of the speaker provided from the selected camera is satisfactory. 6. The method of claim 5 , further comprising: utilizing the determined speaker world coordinates to evaluate the facial view of the speaker from each camera of the plurality of cameras other than the selected camera when the frontal view of the speaker provided from the selected camera is not satisfactory; and providing a framed view of the speaker from the camera of the plurality of cameras that has the best frontal view of the speaker when the frontal view of the speaker provided from the selected camera is not satisfactory. 7. The method of claim 1 , wherein the one camera is the central camera of the plurality of cameras. 8. A non-transitory processor readable memory containing instructions that when executed cause a processor or processors to perform the following method of selecting a camera of a plurality of cameras, each with a different view of a group of participants in an environment and providing a video stream, one camera of the plurality of cameras having a microphone array, to provide a video stream for provision to a far end, the method comprising: determining the world coordinates of each participant for each camera of the plurality of cameras; utilizing sound source localization using the microphone array on the one camera to determine speaker direction information; identifying a speaker in the group of participants using the speaker direction information and an image from the video stream of the one camera; determining world coordinates of the speaker based on the identification; determining facial pose of the speaker in the image from the video stream of the one camera; selecting a camera from the plurality of cameras to provide a video stream for provision to the far end based on the locations of the plurality of cameras other than the one camera and the facial pose of the speaker; and utilizing the determined speaker world coordinates to frame the speaker in the video stream of the selected camera. 9. The non-transitory processor readable memory of claim 8 , the method further comprising: determining the rotation and translation of a coordinate system of each of the plurality of cameras to a world coordinate system. 10. The non-transitory processor readable memory of claim 9 , the method further comprising: selecting the camera providing the most frontal views of participants when there is not a speaker and there are participants; and selecting a default camera when there are no participants. 11. The non-transitory processor readable memory of claim 8 , wherein determining the world coordinates of each participant includes storing the determined world coordinates of each participant in a table of cameras and individuals from the perspective of the camera, and wherein utilizing the determined speaker world coordinates to frame the speaker includes using the determined speaker world coordinates to find the appropriate individual for the selected camera from the table. 12. The non-transitory processor readable memory of claim 8 , the method further comprising: determining if the frontal view of the speaker provided from the selected camera is satisfactory; and providing a framed view of the speaker from the selected camera when the frontal view of the speaker provided from the selected camera is satisfactory. 13. The non-transitory processor readable memory of claim 12 , the method further comprising: utilizing the determined speaker world coordinates to evaluate the frontal view of the speaker from each camera of the plurality of cameras other than the selected camera when the frontal view of the speaker provided from the selected camera is not satisfactory; and providing a framed view of the speaker from the camera of the plurality of cameras that has the best frontal view of the speaker when the frontal view of the speaker provided from the selected camera is not satisfactory. 14. The non-transitory processor readable memory of claim 8 , wherein the one camera is the central camera of the plurality of cameras. 15. A system for selecting a camera of a plurality of cameras, each with a different view of a group of participants in an environment, to provide a video stream for provision to a far end, the system comprising: a plurality of cameras, each camera including: an imager; a camera output interface for providing data and a video stream; camera random access memory (RAM); a camera processor coupled to the imager, the camera output interface and the camera RAM for executing instructions; and camera memory coupled to the camera processor for storing instructions executed by the processor, the camera memory storing instructions executed by the camera processor to perform the operation of providing a video stream from the camera, one camera of the plurality of cameras further including a microphone array and the camera memory of the one camera further storing instructions to utilize sound source localization using the microphone array to determine direction information and provide the direction information; and a codec coupled to the plurality of cameras, the codec including: a codec input interface for coupling to the plurality of cameras to receive data and video streams; a network interfa
Conference systems · CPC title
Detection; Localisation; Normalisation · CPC title
Decision making techniques; Pattern matching strategies · CPC title
Conducting the conference, e.g. admission, detection, selection or grouping of participants, correlating users to one or more conference sessions, prioritising transmission · CPC title
for generating image signals corresponding to three or more geometrical viewpoints, e.g. multi-view systems · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.