Assignment of Unique Identifications to People in Multi-Camera Field of View
US-2024135748-A1 · Apr 25, 2024 · US
US12412421B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12412421-B2 |
| Application number | US-202317971243-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 5, 2023 |
| Priority date | Feb 5, 2023 |
| Publication date | Sep 9, 2025 |
| Grant date | Sep 9, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A multi-camera video conference call system is provided with a plurality of cameras connected together over a communication network to generate a corresponding plurality of input frame images taken from different perspectives of a video conference room, where the multi-camera video conference call system detects one or more human heads for any meeting participants captured in the input frame images, generates a head bounding box which surrounds each detected human head, extracts a body bounding box which surrounds the detected human head and at least an upper body portion of a meeting participant belonging to the detected human head, generates a participant identification feature embedding from each body bounding box, and performs person re-identification processing on all generated participant identification feature embeddings to determine a count of the meeting participants in the video conference room.
Opening claim text (preview).
What is claimed is: 1. A method for identifying meeting participants in a multi-camera video conference room, comprising: generating a plurality of input frame images taken from different perspectives of a video conference room by a corresponding plurality of cameras connected together; detecting, from an input frame image associated with each camera, one or more human heads for any meeting participants captured in the input frame image by applying a machine learning human head detector model to said input image frame; generating, from each detected human head, a head bounding box which surrounds the detected human head; extracting, from each head bounding box, a body bounding box which surrounds the detected human head and at least an upper body portion of a meeting participant belonging to the detected human head, thereby generating a plurality of body bounding boxes from the plurality of input frame images; generating, from each input frame image portion contained within the body bounding box, a participant identification feature embedding which uniquely identifies the meeting participant captured in the body bounding box, thereby generating a plurality of participant identification feature embeddings from the plurality of body bounding boxes; and performing person re-identification processing on the plurality of participant identification feature embeddings to determine a count of the meeting participants in the video conference room, wherein performing person re-identification processing comprises: dividing the plurality of participant identification feature embeddings into a query set and a gallery set, and comparing the query set to the gallery set to identify k top feature embedding matches so that matching feature embeddings are assigned to the same meeting participant, wherein the query set contains participant identification feature embeddings extracted from body bounding boxes generated from a first input frame captured at a primary camera, and wherein the gallery set contains participant identification feature embeddings extracted from body bounding boxes generated from one or more additional input frames captured at one or more secondary cameras. 2. The method of claim 1 , where detecting one or more human heads comprises classifying each detected human head as having a frontal, profile, or back head orientation and discarding any detected human head that is classified as a profile or back head orientation before extracting, from each head bounding box, a body bounding box. 3. The method of claim 1 , wherein detecting one or more human heads comprises: applying image pre-processing to each input frame image; applying a machine learning human head detector model to each input image frame to generate an output tensor for each detected human head; and applying image post-processing to convert each output tensor to a head bounding box which surrounds a corresponding detected human head. 4. The method of claim 1 , wherein extracting each body bounding box comprises extending the head bounding box by predetermined distances in both vertical and horizontal directions to surround the detected human head and at least the upper body portion of the meeting participant belonging to the detected human head. 5. The method of claim 1 , wherein generating each participant identification feature embedding comprises applying a deep convolutional neural network (CNN) model to generate a multi-dimensional feature embedding for each body bounding box. 6. The method of claim 1 , wherein the plurality of participant identification feature embeddings are generated at the plurality of cameras, and where a central codec performs person re-identification processing on the plurality of participant identification feature embeddings. 7. A system for identifying meeting participants in a multi-camera video conference room, comprising: a plurality of camera input devices connected over a communication network to a video codec device, where each of the camera input devices comprises: a first processor; a first data bus coupled to the first processor; and a non-transitory, computer-readable storage medium embodying computer program code, the non-transitory, computer-readable storage medium being coupled to the first data bus, the computer program code interacting with a plurality of computer operations and comprising first instructions executable by the first processor and configured for: generating an input frame image taken from a different perspective of a video conference room; detecting, from an input frame image associated with each camera, one or more human heads for any meeting participants captured in the input frame image by applying a machine learning human head detector model to said input image frame; generating, from each detected human head, a head bounding box which surrounds the detected human head; extracting, from each head bounding box, a body bounding box which surrounds the detected human head and at least an upper body portion of a meeting participant belonging to the detected human head; and generating, from each input frame image portion contained within the body bounding box, a participant identification feature embedding which uniquely identifies the meeting participant captured in the body bounding box; and where the video codec device comprises: a second processor; a second data bus coupled to the second processor; and a non-transitory, computer-readable storage medium embodying computer program code, the non-transitory, computer-readable storage medium being coupled to the second data bus, the computer program code interacting with a plurality of computer operations and comprising second instructions executable by the second processor and configured for: performing person re-identification processing on participant identification feature embeddings generated by the plurality of input camera devices to determine a count of the meeting participants in the video conference room, wherein the second instructions executable by the processor are configured for performing person re-identification processing by: dividing the plurality of participant identification feature embeddings into a query set and a gallery set, comparing the query set to the gallery set to identify k top feature embedding matches so that matching feature embeddings are assigned to the same meeting participant, wherein the query set contains participant identification feature embeddings extracted from body bounding boxes generated from a first input frame captured at a primary camera input device, and wherein the gallery set contains participant identification feature embeddings extracted from body bounding boxes generated from one or more additional input frames captured at one or more secondary camera input devices. 8. The system of claim 7 , wherein the first instructions executable by the processor are configured for detecting one or more human heads by classifying each detected human head as having a frontal, profile, or back head orientation and discarding any detected human head that is classified as a profile or back head orientation before extracting, from each head bounding box, a body bounding box. 9. The system of claim 7 , wherein the first instructions executable by the processor are configured for detecting one or more human heads by: applying image pre-processing to each input frame image; applying a machine learning human head detector model to each input image frame to generate an output tensor for each detected human head; and applying image post-processing to convert each output tensor to a head bounding box which surrounds a corresponding detected human head. 10. The system o
Determining position or orientation of objects or cameras (camera calibration G06T7/80) · CPC title
Artificial neural networks [ANN] · CPC title
Counting objects in image · CPC title
Classification, e.g. identification · CPC title
Surveillance or monitoring of activities, e.g. for recognising suspicious objects (recognising microscopic objects G06V20/69) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.