Music driven human dancing video synthesis
US-2020342646-A1 · Oct 29, 2020 · US
US12283087B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12283087-B2 |
| Application number | US-202017109072-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 1, 2020 |
| Priority date | Nov 19, 2019 |
| Publication date | Apr 22, 2025 |
| Grant date | Apr 22, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A model training method includes obtaining an image sample set and brief-prompt information; generating a content mask set according to the image sample set and the brief-prompt information; generating a to-be-trained image set according to the content mask set; obtaining, based on the image sample set and the to-be-trained image set, a predicted image set through a to-be-trained information synthesis model, the predicted image set comprising at least one predicted image, the predicted image being in correspondence to the image sample; and training, based on the predicted image set and the image sample set, the to-be-trained information synthesis model by using a target loss function, to obtain an information synthesis model.
Opening claim text (preview).
What is claimed is: 1. A model training method, the method comprising: obtaining an image sample set and brief-prompt information, the image sample set comprising at least one image sample, the brief-prompt information representing key-point information of a to-be-trained object in the at least one image sample, wherein the at least one image sample includes a plurality of consecutive image samples, and the plurality of consecutive image samples are used for forming a video sample; generating a content mask set according to the image sample set and the brief-prompt information, the content mask set comprising at least one content mask, the at least one content mask being obtained by extending outward a region identified according to the brief-prompt information in the at least one image sample; generating a to-be-trained image set according to the content mask set, the to-be-trained image set comprising at least one to-be-trained image, the at least one to-be-trained image being in correspondence to the at least one image sample; obtaining, based on the image sample set and the to-be-trained image set, a predicted image set through a to-be-trained information synthesis model, the predicted image set comprising at least one predicted image, the at least one predicted image being in correspondence to the at least one image sample; and training, based on the predicted image set and the image sample set, the to-be-trained information synthesis model by using a target loss function, to obtain an information synthesis model, comprising: determining a first loss function according to N frames of predicted images in the predicted image set, N frames of to-be-trained images in the to-be-trained image set, and N frames of image samples in the image sample set, N being an integer greater than 1, wherein the first loss function is determined based on an output of a generator of the to-be-trained information synthesis model when inputting a superposition of (N- 1 ) frames of to-be-trained images and an Nth frame of to-be-trained image to the generator; determining a second loss function according to N frames of predicted images in the predicted image set and N frames of image samples in the image sample set; determining the target loss function according to the first loss function and the second loss function; iteratively updating a model parameter of the to-be-trained information synthesis model according to the target loss function; and generating, in a case that an iteration end condition is satisfied, the information synthesis model according to the model parameter of the to-be-trained information synthesis model. 2. The method according to claim 1 , wherein: the to-be-trained object is a human body object; the obtaining an image sample set and brief-prompt information comprises: obtaining the image sample set; and obtaining the brief-prompt information corresponding to the at least one image sample in the image sample set by using a human body pose estimator method; and the generating a content mask set according to the image sample set and the brief-prompt information comprises: generating, based on the at least one image sample in the image sample set and according to the brief-prompt information corresponding to the to-be-trained object, a human body key-point image; generating, based on the human body key-point image corresponding to the at least one image sample in the image sample set, a human body skeleton connection image; and generating, based on the human body skeleton connection image corresponding to the at least one image sample in the image sample set, a human body content mask by using a convex hull algorithm, the human body content mask belonging to the con at least one tent mask. 3. The method according to claim 2 , wherein the generating a to-be-trained image set according to the content mask set comprises: covering, based on the human body content mask in the content mask set, the human body content mask on the at least one image sample, and filling the to-be-trained object back to the at least one image sample, to obtain the at least one to-be-trained image in the to-be-trained image set. 4. The method according to claim 1 , wherein the generating a content mask set according to the image sample set and the brief-prompt information comprises: generating, based on the at least one image sample in the image sample set and according to the brief-prompt information corresponding to the to-be-trained object, K target human face key-points, each of the K target human face key-points being in correspondence to a human face key-point in the brief-prompt information, K being an integer greater than 1; generating, based on the K target human face key-points of the at least one image sample in the image sample set, an original human face content mask by using a convex hull algorithm; generating, based on the original human face content mask of the at least one image sample in the image sample set, an expanded human face content mask according to a mask expansion proportion, the expanded human face content mask belonging to the at least one content mask; and generating, based on the original human face content mask of the at least one image sample in the image sample set, a contracted human face content mask according to a mask contraction proportion, the contracted human face content mask belonging to the at least one content mask. 5. The method according to claim 4 , wherein the generating a to-be-trained image set according to the content mask set comprises: covering the expanded human face content mask on a target image sample of the at least one image sample, to obtain a first mask image, wherein a region corresponding to the expanded human face content mask in the target image sample is set to a blank region; extracting image content of a region corresponding to the contracted human face content mask in the target image sample, to obtain a second mask image; and generating, by filling the second mask image into the blank region in the first mask image, one of the at least one to-be-trained image corresponding to the target image sample. 6. The method according to claim 1 , wherein the determining the target loss function according to the first loss function and the second loss function comprises: calculating the target loss function in the following manner: L ( G,D )= E f,r [L r ( G )+λ s L s ( G,D )]; L r ( G )=∥ m ⊗( f−G ( r ))∥ 1 ; L s ( G,D )=log( D ( r,f ))+log(1− D ( r,G ( r ))); wherein L(G,D) represents the target loss function, E represents an expected value calculation, L r (G) represents the first loss function, L s (G,D) represents the second loss function, G( ) represents the generator in the to-be-trained information synthesis model, D( ) represents a discriminator in the to-be-trained information synthesis model, λ s represents a first preset coefficient, O represents the (N−1) frames of the to-be-trained images, ƒ represents an N th frame of image sample, r represents an N th frame of to-be-trained image, m represents a content mask of the N th frame ⊗ of to-be-trained image, & represents a per-pixel multiplication, and ⊕ represents the superposition of image frames. 7. The method according to claim 1 , wherein: the training, based on the predicted image set and the image sample set, the to-be-trained information synthesis model by using a target loss function, to obtain an information synthesis model further comprises: determining a third loss function according to M frames of predicted images in the predicted image set and M frames of image samples in the image sample set, M being an integer greater than or equal to 1 and less than or equal to N; and determining the target loss func
Machine learning · CPC title
Organisation of the process, e.g. bagging or boosting · CPC title
characterised by the process organisation or structure, e.g. boosting cascade · CPC title
Validation; Performance evaluation; Active pattern learning techniques · CPC title
Local features and components; Facial parts (eye characteristics G06V40/18); Occluding parts, e.g. glasses; Geometrical relationships · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.