Technique for sensor data based medical examination report generation
US-2024282419-A1 · Aug 22, 2024 · US
US12277758B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12277758-B2 |
| Application number | US-202318400856-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 29, 2023 |
| Priority date | Mar 24, 2023 |
| Publication date | Apr 15, 2025 |
| Grant date | Apr 15, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium. In one aspect, a method includes receiving a text prompt describing a scene; processing the text prompt using a text encoder neural network to generate a contextual embedding of the text prompt; and processing the contextual embedding using a sequence of generative neural networks to generate a final video depicting the scene.
Opening claim text (preview).
What is claimed is: 1. A method performed by one or more computers, the method comprising: receiving a text prompt describing a scene; processing the text prompt, using a text encoder neural network, to generate a contextual embedding of the text prompt; and processing the contextual embedding, using a sequence of diffusion-based generative neural networks, to generate a final video depicting the scene, wherein the sequence of diffusion-based generative neural networks comprises: an initial diffusion-based generative neural network configured to: receive the contextual embedding; and process the contextual embedding to generate, as output, an initial output video having: (i) an initial spatial resolution, and (ii) an initial temporal resolution; and one or more subsequent diffusion-based generative neural networks each configured to: receive a respective input comprising an input video generated as output by a preceding diffusion-based generative neural network in the sequence; and process the respective input to generate, as output, a respective output video having at least one of: (i) a higher spatial resolution, or (ii) a higher temporal resolution, than the input video, wherein the diffusion-based generative neural networks have been jointly trained on training data comprising a plurality of training examples that each include: (i) respective input text describing a respective scene, and (ii) a corresponding target video depicting the respective scene, wherein the text encoder neural network is pre-trained and was held frozen during the joint training of the diffusion-based generated neural networks, and wherein training each subsequent diffusion-based generative neural network on the training data comprised: obtaining a resized version of the respective target video of each training example to generate: (i) a respective training input video, and (ii) a corresponding target output video; and training the subsequent diffusion-based generative neural network using the training input and target output videos of each training example. 2. The method of claim 1 , wherein training the subsequent diffusion-based generative neural network using the training input and target output videos of each training example comprised: processing the respective training input video of each training example, using the subsequent diffusion-based generative neural network, to generate a respective training output video; calculating a gradient of an objective function that depends on the training and target output videos of each training example; and updating a set of network parameters of the subsequent diffusion-based generative neural network according to the gradient of the objective function. 3. The method of claim 1 , wherein training the initial diffusion-based generative neural network on the training data comprised: processing the respective input text of each training example, using the text encoder neural network, to generate a respective contextual embedding of the respective input text; resizing the respective target video of each training example to generate a corresponding initial target output video; and training the initial diffusion-based generative neural network using the contextual embedding and initial target output video of each training example. 4. The method of claim 3 , wherein training the initial diffusion-based generative neural network using the contextual embedding and initial target output video of each training example comprised: processing the respective contextual embedding of each training example, using the initial diffusion-based generative neural network, to generate a respective initial training output video; calculating a gradient of an objective function that depends on the initial training and target output videos of each training example; and updating a set of network parameters of the initial diffusion-based generative neural network according to the gradient of the objective function. 5. The method of claim 3 , wherein: the respective input of each subsequent diffusion-based generative neural network further comprises the contextual embedding of the text prompt, and training each subsequent diffusion-based generative neural network on the training data further comprised: training the subsequent diffusion-based generative neural network using the contextual embedding of each training example. 6. The method of claim 5 , wherein training the subsequent generative neural network using the contextual embedding, training input video, and target output video of each training example comprised: generating, for each training example, a respective input comprising: (i) the respective training input video, and (ii) the respective input contextual embedding; processing the respective input of each training example, using the subsequent diffusion-based generative neural network, to generate a respective training output video; calculating a gradient of an objective function that depends on the training and target output videos of each training example; and updating a set of network parameters of the subsequent diffusion-based generative neural network according to the gradient of the objective function. 7. The method of claim 1 , wherein: the one or more subsequent diffusion-based generative neural networks are a plurality of subsequent diffusion-based generative neural networks, and the respective output video of each of the plurality of subsequent diffusion-based generative neural networks has one of: (i) a higher spatial resolution, or (ii) a higher temporal resolution, than the input video. 8. The method of claim 7 , wherein: the initial diffusion-based generative neural network implements spatial self-attention and temporal self-attention, and each subsequent diffusion-based generative neural network implements spatial convolution and temporal convolution. 9. The method of claim 8 , wherein: the initial diffusion-based generative neural network further implements spatial convolution, and each subsequent diffusion-based generative neural network that is not a final diffusion-based generative neural network in the sequence further implements spatial self-attention. 10. The method of claim 1 , wherein the diffusion-based generative neural networks were trained using classifier-free guidance. 11. The method of claim 1 , wherein each diffusion-based generative neural network uses a v-prediction parametrization to generate the respective output video. 12. The method of claim 11 , wherein each diffusion-based generative neural network further uses progressive distillation to generate the respective output video. 13. The method of claim 1 , wherein each subsequent diffusion-based generative neural network applies noise conditioning augmentation on the input video. 14. The method of claim 1 , wherein the final video is the respective output video of a final diffusion-based generative neural network in the sequence. 15. The method of claim 1 , wherein the initial spatial resolution of the initial output video corresponds to an initial per frame pixel resolution, with higher spatial resolutions corresponding to higher per frame pixel resolutions. 16. The method of claim 1 , wherein the initial temporal resolution of the initial output video corresponds to an initial framerate, with higher temporal resolutions corresponding to higher framerates. 17. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more comput
based on super-resolution, i.e. the output image resolution being higher than the sensor resolution · CPC title
Combinations of networks · CPC title
using neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.