Diffusion-based open-vocabulary segmentation

US12586199B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12586199-B2
Application numberUS-202318310414-A
CountryUS
Kind codeB2
Filing dateMay 1, 2023
Priority dateNov 1, 2022
Publication dateMar 24, 2026
Grant dateMar 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An open-vocabulary diffusion-based panoptic segmentation system is not limited to perform segmentation using only object categories seen during training, and instead can also successfully perform segmentation of object categories not seen during training and only seen during testing and inferencing. In contrast with conventional techniques, a text-conditioned diffusion (generative) model is used to perform the segmentation. The text-conditioned diffusion model is pre-trained to generate images from text captions, including computing internal representations that provide spatially well-differentiated object features. The internal representations computed within the diffusion model comprise object masks and a semantic visual representation of the object. The semantic visual representation may be extracted from the diffusion model and used in conjunction with a text representation of a category label to classify the object. Objects are classified by associating the text representations of category labels with the object masks and their semantic visual representations to produce panoptic segmentation data.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of generating segmentation data, comprising: processing an input image and corresponding metadata representing a description of the input image by a diffusion model that has been trained to synthesize an image based on the description; extracting an internal feature representation of the input image defined by features computed by at least one intermediate layer during at least one processing iteration of the diffusion model; and computing the segmentation data for the input image using the internal feature representation. 2 . The method of claim 1 , wherein the segmentation data comprises object masks for one or more objects depicted in the input image and object category labels corresponding to the description to the object masks that are mapped to the object masks. 3 . The method of claim 1 , further comprising generating panoptic segmentation data for the input image based on the segmentation data and text embeddings corresponding to a caption associated with the description or object category labels corresponding to the description. 4 . The method of claim 3 , further comprising: extracting the object category labels from the caption; and processing the object category labels by a text encoder to produce the text embeddings. 5 . The method of claim 4 , wherein a mask generator applies parameters to the internal feature representation to compute the segmentation data comprising object masks and mask embeddings. 6 . The method of claim 5 , wherein during training of the parameters, the object category labels comprise a training set of object category labels and during inference when the parameters are unchanged at least one new object category label that is not included in the set is encoded in the text embeddings. 7 . The method of claim 1 , wherein the metadata comprises an encoded text caption. 8 . The method of claim 7 , further comprising processing the input image by an implicit captioner to generate the encoded text caption. 9 . The method of claim 8 , wherein an image encoder processes the input image to generate image features and a multilayer perceptron projects the image features to generate the encoded text caption. 10 . The method of claim 9 , wherein the segmentation data comprises object masks and mask embeddings and further comprising: processing the image features by a mask pooling unit to produce additional mask embeddings; and combining the text embeddings corresponding to object category labels, the mask embeddings, and the additional mask embeddings to generate panoptic segmentation data for the input image. 11 . The method of claim 10 , wherein the object category labels include at least one object category label that was not used to train the mask pooling unit and the multilayer perceptron. 12 . The method of claim 1 , wherein at least one of the steps of processing, extracting, or computing is performed on a server or in a data center and the segmentation data is streamed to a user device. 13 . The method of claim 1 , wherein at least one of the steps of processing, extracting, or computing is performed within a cloud computing environment. 14 . The method of claim 1 , wherein at least one of the steps of processing, extracting, or computing is for training, testing, or certifying a neural network employed in a machine, robot, or autonomous vehicle. 15 . The method of claim 1 , wherein at least one of the steps of processing, extracting, or computing is performed on a virtual machine comprising a portion of a graphics processing unit. 16 . A system, comprising: a processor configured to execute a diffusion model to generate segmentation data by: processing an input image and corresponding metadata representing a description of the input image by a diffusion model that has been trained to synthesize an image based on the description; extracting an internal feature representation of the input image defined by features computed by at least one intermediate layer during at least one processing iteration of the diffusion model; and computing the segmentation data for the input image using the internal feature representation. 17 . The system of claim 16 , wherein the segmentation data comprises object masks for one or more objects depicted in the input image and object category labels corresponding to the description to the object masks that are mapped to the object masks. 18 . The system of claim 16 , further comprising generating panoptic segmentation data for the input image based on the segmentation data and text embeddings corresponding to a caption associated with the description or object category labels corresponding to the description. 19 . A non-transitory computer-readable media storing computer instructions that, when executed by one or more processors, cause the one or more processors to generate segmentation data by performing the steps of: processing an input image and corresponding metadata representing a description of the input image by a diffusion model that has been trained to synthesize an image based on the description; extracting an internal feature representation of the input image defined by features computed by at least one intermediate layer during at least one processing iteration of the diffusion model; and computing the segmentation data for the input image using the internal feature representation. 20 . The non-transitory computer-readable media of claim 19 , further comprising generating panoptic segmentation data for the input image based on the segmentation data and text embeddings corresponding to a caption associated with the description or object category labels corresponding to the description.

Assignees

Inventors

Classifications

  • Extraction of image or video features · CPC title

  • Artificial neural networks [ANN] · CPC title

  • Training; Learning · CPC title

  • Recognition assisted with metadata · CPC title

  • Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12586199B2 cover?
An open-vocabulary diffusion-based panoptic segmentation system is not limited to perform segmentation using only object categories seen during training, and instead can also successfully perform segmentation of object categories not seen during training and only seen during testing and inferencing. In contrast with conventional techniques, a text-conditioned diffusion (generative) model is use…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06T7/10. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).