We design a new architecture that can support 10+ control types in condition text-to-image generation and can generate high resolution images visually comparable with midjourney. The network is based on the original ControlNet architecture, we propose two new modules to: 1 Extend the original ControlNet to support different image conditions using the same network parameter. 2 Support multiple conditions input without increasing computation offload, which is especially important for designers who want to edit image in detail, different conditions use the same condition encoder, without adding extra computations or parameters. We do thoroughly experiments on SDXL and achieve superior performance both in control ability and aesthetic score. We release the method and the model to the open source community to make everyone can enjoy it.
If you find it useful, please give me a star, Thank you very much!!
SDXL ProMax version has been released!!!,Enjoy it!!!
I am sorry that because of the project's revenue and expenditure are difficult to balance, the GPU resources are assigned to other projects that are more likely to be profitable, the SD3 trainging is stopped until I find enough GPU supprt, I will try my best to find GPUs to continue training. If this brings you inconvenience, I sincerely apologize for that. I want to thank everyone who likes this project, your support is what keeps me going
Note: we put the promax model with a promax suffix in the same huggingface model repo, detailed instructions will be added later.
Following example show from 1M resolution --> 9M resolution
Use bucket training like novelai, can generate high resolutions images of any aspect ratio
Use large amount of high quality data(over 10000000 images), the dataset covers a diversity of situation
Use re-captioned prompt like DALLE.3, use CogVLM to generate detailed description, good prompt following ability
Use many useful tricks during training. Including but not limited to date augmentation, mutiple loss, multi resolution
Use almost the same parameter compared with original ControlNet. No obvious increase in network parameter or computation.
Support 10+ control conditions, no obvious performance drop on any single condition compared with training independently
Support multi condition generation, condition fusion is learned during training. No need to set hyperparameter or design prompts.
Compatible with other opensource SDXL models, such as BluePencilXL, CounterfeitXL. Compatible with other Lora models.
https://huggingface.co/xinsir/controlnet-openpose-sdxl-1.0
https://huggingface.co/xinsir/controlnet-scribble-sdxl-1.0
https://huggingface.co/xinsir/controlnet-tile-sdxl-1.0
https://huggingface.co/xinsir/controlnet-canny-sdxl-1.0
[07/06/2024] Release ControlNet++
and pretrained models.
[07/06/2024] Release inference code(single condition & multi condition).
[07/13/2024] Release ProMax ControlNet++
with advanced editing function.
ControlNet++ for gradio
ControlNet++ for Comfyui
release training code and training guidance.
release arxiv paper.
One of the most important controlnet models, we use many tricks in training this model, equally as good as https://huggingface.co/xinsir/controlnet-openpose-sdxl-1.0, SOTA performance in pose control. To make the openpose model reach its best performance, you should replace the draw_pose function in controlnet_aux package(comfyui has its own controlnet_aux package), refer to the Inference Scripts for detail.
One of the most important controlnet models, canny is mixed training with lineart, anime lineart, mlsd. Robust performance in deal with any thin lines, the model is the key to decrease the deformity rate, use thin line to redraw the hand/foot is recommended.
One of the most important controlnet models, scribble model can support any line width and any line type. equally as good as https://huggingface.co/xinsir/controlnet-scribble-sdxl-1.0, make everyone a soul painter.
Note: use pose skeleton to control the human pose, use thin line to draw the hand/foot detail to avoid deformity
Note: depth image contains detail info, it's recommoned to use depth for background and use pose skeleton for foreground
Note: Scribble is a strong line model, if you want to draw something with not strict outline, you can use it. Openpose + Scribble gives you more freedom to generate your initial image, then you can use thin line to edit the detail.
We collect a large amount of high quality images. The images are filtered and annotated seriously, the images covers a wide range of subjects, including photogragh, anime, nature, midjourney and so on.
We propose two new module in ControlNet++, named Condition Transformer and Control Encoder, repectively. We modified an old module slightly to enhance its representation ability. Besides, we propose an unified training strategy to realize single & multi control in one stage.
For each condition, we assign it with a control type id, for example, openpose--(1, 0, 0, 0, 0, 0), depth--(0, 1, 0, 0, 0, 0), multi conditions will be like (openpose, depth) --(1, 1, 0, 0, 0, 0). In the Control Encoder, the control type id will be convert to control type embeddings(using sinusoidal positional embeddings), then we use a single linear layer to proj the control type embeddings to have the same dim with time embedding. The control type features are added to the time embedding to indicate different control types, this simple setting can help the ControlNet to distinguish different control types as time embedding tends to have a global effect on the entire network. No matter single condition or multi condition, there is a unique control type id correpond to it.
We extend the ControlNet to support multiple control inputs at the same time using the same network. The condition transformer is used to combine different image condition features. There are two major innovations about our methods, first, different conditions shares the same condition encoder, which makes the network more simple and lightwight. this is different with other mainstream methods like T2I or UniControlNet. Second, we add a tranformer layer to exchange the info of original image and the condition images, instead of using the output of transformer directly, we use it to predict a condition bias to the original condition feature. This is somewhat like ResNet, and we experimentally found this setting can improve the performance of the network obviously.
The original condition encoder of ControlNet is a stack of conv layer and Silu activations. We don't change the encoder architecture, we just increase the conv channels to get a "fat" encoder. This can increase the performance of the network obviously. The reason is that we share the same encoder for all image conditions, so it requires the encoder to have higher representation ability. Original setting will be well for single condition but not as good for 10+ conditions. Note that using the original setting is also OK, just with some sacrifice of image generation quality.
Training with single Condition may be limited by data diversity. For example, openpose requires you to train with images with people and mlsd requires you to train with images with lines, thus may affect the performance when generating unseen objects. Besides, the difficulty of training different conditions is different, it is tricky to get all condition converge at the same time and reach the best performance of each single condition. Finally, we will tend to use two or more conditions at the same time, multi condition training will make the fusion of different conditions more smoothly and increase the robustness of the network(as single condition learn limited knowledge). We propose an unified training stage to realize the single condition optim converge and multi condition fusion at the same time.
ControlNet++ requires to pass a control type id to the network. We merge the 10+ control to 6 control types, the meaning of each type is as follows:
0 -- openpose
1 -- depth
2 -- thick line(scribble/hed/softedge/ted-512)
3 -- thin line(canny/mlsd/lineart/animelineart/ted-1280)
4 -- normal
5 -- segment
We recommend a python version >= 3.8, you can set the virtual environment using the following command:
conda create -n controlplus python=3.8 conda activate controlplus pip install -r requirements.txt
You download the model weight in https://huggingface.co/xinsir/controlnet-union-sdxl-1.0. Any new model release will be put on the huggingface, you can follow https://huggingface.co/xinsir to get the newest model info.
We provide a inference scripts for each control condition. Please refer to it for more detail.
There exists some preprocess difference, to get the best openpose-control performance, please do the following: Find the util.py in controlnet_aux package, replace the draw_bodypose function with the following code
def draw_bodypose(canvas: np.ndarray, keypoints: List[Keypoint]) -> np.ndarray: """ Draw keypoints and limbs representing body pose on a given canvas. Args: canvas (np.ndarray): A 3D numpy array representing the canvas (image) on which to draw the body pose. keypoints (List[Keypoint]): A list of Keypoint objects representing the body keypoints to be drawn. Returns: np.ndarray: A 3D numpy array representing the modified canvas with the drawn body pose. Note: The function expects the x and y coordinates of the keypoints to be normalized between 0 and 1. """ H, W, C = canvas.shape if max(W, H) < 500: ratio = 1.0 elif max(W, H) >= 500 and max(W, H) < 1000: ratio = 2.0 elif max(W, H) >= 1000 and max(W, H) < 2000: ratio = 3.0 elif max(W, H) >= 2000 and max(W, H) < 3000: ratio = 4.0 elif max(W, H) >= 3000 and max(W, H) < 4000: ratio = 5.0 elif max(W, H) >= 4000 and max(W, H) < 5000: ratio = 6.0 else: ratio = 7.0 stickwidth = 4 limbSeq = [ [2, 3], [2, 6], [3, 4], [4, 5], [6, 7], [7, 8], [2, 9], [9, 10], [10, 11], [2, 12], [12, 13], [13, 14], [2, 1], [1, 15], [15, 17], [1, 16], [16, 18], ] colors = [[255, 0, 0], [255, 85, 0], [255, 170, 0], [255, 255, 0], [170, 255, 0], [85, 255, 0], [0, 255, 0], [0, 255, 85], [0, 255, 170], [0, 255, 255], [0, 170, 255], [0, 85, 255], [0, 0, 255], [85, 0, 255], [170, 0, 255], [255, 0, 255], [255, 0, 170], [255, 0, 85]] for (k1_index, k2_index), color in zip(limbSeq, colors): keypoint1 = keypoints[k1_index - 1] keypoint2 = keypoints[k2_index - 1] if keypoint1 is None or keypoint2 is None: continue Y = np.array([keypoint1.x, keypoint2.x]) * float(W) X = np.array([keypoint1.y, keypoint2.y]) * float(H) mX = np.mean(X) mY = np.mean(Y) length = ((X[0] - X[1]) ** 2 + (Y[0] - Y[1]) ** 2) ** 0.5 angle = math.degrees(math.atan2(X[0] - X[1], Y[0] - Y[1])) polygon = cv2.ellipse2Poly((int(mY), int(mX)), (int(length / 2), int(stickwidth * ratio)), int(angle), 0, 360, 1) cv2.fillConvexPoly(canvas, polygon, [int(float(c) * 0.6) for c in color]) for keypoint, color in zip(keypoints, colors): if keypoint is None: continue x, y = keypoint.x, keypoint.y x = int(x * W) y = int(y * H) cv2.circle(canvas, (int(x), int(y)), int(4 * ratio), color, thickness=-1) return canvas
For single condition inference, you should give a prompt and an control image, change the correspond lines in python file.
python controlnet_union_test_openpose.py
For multi condition inference, you should ensure your input image_list compatible with your control_type, for example, if you want to use openpose and depth control, image_list --> [controlnet_img_pose, controlnet_img_depth, 0, 0, 0, 0], control_type --> [1, 1, 0, 0, 0, 0]. Refer to the controlnet_union_test_multi_control.py for more detail.
In theory, you don't need to set the condition scale for different conditions, the network is designed and trained to fuse different conditions naturally. Default setting is 1.0 for each condition input, and it is the same with multi condition training.
However, if you want to increase the affect for some certain input condition, you can adjust the condition scales in Condition Transformer Module. In that module, the input conditions will be added to the source image features along with the bias prediction.
multiply it with a certain scale will affect a lot(but may be cause some unknown result).
python controlnet_union_test_multi_control.py