I've been experimenting with Phaze A for a year now using Nvidia A100 cloud GPUs and have tried a few common and 1 not so common setup and wanted to share some of my notes on how different model architectures effect results.
split fc layer, gblock enabled (not split), shared decoders:
This is probably the most popular setup and is the best choice if your A data has a lot of poses / angles that your B data lacks. The shared decoder is really good at filling in the blanks, however there tends to be a fair amount of identity bleed.
shared fc layer, gblock enabled (not split), split decoder:
This has produced the worst results and in my experience has been the only setup to cause discoloration in the forehead when the hairlines differ.
split fc layer, gblock enabled (not split), split decoder:
This is the least common setup but my personal favorite when you have a good amount of B data. This setup results in strikingly accurate detail and is the closest thing to actually swapping the face with 0 identity bleed. The downside to this setup is when you don't have enough B data to fill in the blanks. The model does some frightening things when it only has the G-block (a GAN) to fill in the blanks.
I did a few experiments with a split g-block but I didn't really notice any significant improvement or degradation either way...
A few notes on what I've found to be ideal settings: :
Encoder: Efficientnet2_L has been amazing and I've noticed a huge improvement over v1. I usually try to match the scale with the output.
Bottleneck: Always go with Dense, I've tried both poolings and they result in streaks of color and poor detail. Using the 512 size has never let me down.
fc layer: overcranking this can do more harm than good. With autoencoder models, you generally don't want this to be more detailed than the encoder feeding it. I've noticed better results with a dimension of 8 and 1280 nodes than with a dimension of 12 or 16. On that note, making this deeper (increasing the depth over 1), is unnecessary, a waste of VRAM and, at least in my experience, did nothing to improve the results and may have made them worse.
fc_dropout: I hardly ever use this but the 1 time I did, it surprisingly sped up training massively (which seems counterintuitive).
Upsamplers: Since I was doing most of the training on a powerful GPU, I used subpixel for both upsamplers. I would say 512 is probably a decent amount of filters. I had this at 1280 but eventually dropped it to 512 and didn't notice any degradation in results.
Decoders: Allocating more VRAM to these parameters will give you the most bang for your buck in terms of detail in the results. I noticed a huge increase in detail and quality by increasing both first and final amount of filters. If you run into VRAM issues, adjusting the slope of the filter curve (making it steeper) can save you Vram. Adding an additional residual block (or 2) also made a huge difference. I go with kernal size of 3 but have also used the default of 5 a few times and it's hard to say if it made much of a positive difference because other parameters were also changed.
Loss functions: :
As it says in the Training Guide, the choice you make here will have an outsized impact on your entire model. I've tried the all and a combination of MS_SSIM and MAE (L1) at 100% have produced the best results. The weird quirk with MS_SSIM is whenever I've tried to start a model using it, my model crashes (which I honestly can't explain. So I usually start with SSIM then swap it out for MS_SSIM after 1k iterations. I also add a 3rd loss function, ffl at either 25% or 50% and I think it has made a positive impact. I've tried the lpips as tertiary losses and it completely ruined everything with the moire pattern described in the settings. I get that in theory using one of those as a supplimentary loss is supposed to help but I have no idea of how much weight to give it.
Mixed Precission: :
Last but not least, Mixed Precision. You love it and you hate it. It does make a huge difference in training speed and VRAM but is the frequent culprit of NaNs. I did some research on Nvidia's website regarding this and I found the holy grail of hidden information that has cured me of the downside to using it. It all comes down to the Epsilon. Nvidia recommends increasing your epsilon by 1e-3 when training with Mixed Precision. So instead of the default 1e-07, I use 1e-04 and this has made the world of difference with 0 downside in terms of the models ability to learn and most importantly no more NaNs.
These are just a few things I've noticed after experimenting a bit through trial and error and these findings are by no means scientific and would never pass a peer review
I usually train until loss convergence and to around 600k - 800k iterations with a batch size of 8 and learning rate of 3e-05.