[Guide] Introducing - Phaze-A

Post by **torzdf** » Sun Mar 06, 2022 1:34 pm

RisingZen wrote: ↑Fri Mar 04, 2022 11:59 am
I can't wrap my head around how the number of upscales are determined for the decoder. The google docs sheet allows the user to change the number of data points (upscales), but Phaze-A does not offer an option to set it. Am I missing something here?

You can calculate the number of upscales required. This is not something that you have direct control over. It is basically determined by the dimensional size of the input to the decoder (i.e. of the size of the the output of the fully connected layers) and the size of the final image. Eg. If the output of the fully connected layers is 8x8[x1024] and the final image is 64x64[x3] then there will be 3 upscales (8 > 16 > 32 > 64).

An easier method though is to just run the model in "summary" mode, then you can count the number of upscales that will be created,

Is there a specific reason for giving the G-Block three hidden layers?

This was the number that StyleGAN used in their original implementation.

Thanks for testing. I wish I had more feedback/guidance, but testing was pretty much exactly what this model was created for as, as you can see, testing all possible combinations and knowing what impact they will have is a difficult and time-consuming task... especially for 1 person.

RisingZen · Post by **RisingZen** » Mon Mar 07, 2022 8:53 pm

I have moved over to refactoring the code of models as to have more flexibility on where specific layers are called. It's sadly not often the case that observations from custom models can be translated to Phaze-A settings, but some improvements such as EfficientNet v2 definitely will find their way into the previous post.

Glad to see that more people started to experiment with Phaze-A. It has been been a fun time to work with it and if it would not have been for that I wouldn't have delved deeper into ML.

RisingZen · Post by **RisingZen** » Sun Mar 13, 2022 7:39 pm

Merged the posts

filou_7 · Post by **filou_7** » Sat Mar 26, 2022 9:40 pm

Did I understand correctly that I can create an original with 128 pixels with phase a instead of 64 if I load the original preset? that would be great. I get a crash with many trainers and settings, even though I'm working with mac pro 6.1 and AMD RX580 8GB..

Post by **torzdf** » Tue Mar 29, 2022 12:01 am

filou_7 wrote: ↑Sat Mar 26, 2022 9:40 pm
Did I understand correctly that I can create an original with 128 pixels with phase a instead of 64 if I load the original preset? that would be great. I get a crash with many trainers and settings, even though I'm working with mac pro 6.1 and AMD RX580 8GB..

Yes you can. You can load the original preset then tweak it to how you want it. Make sure you understand the part about upscalers (and upscale curve) in the decoders, because that is ultimately going to control your final output resolution.

Comparing the dfaker and original presets may also help, as dfaker is just original with another upscaler to take it to 128px.

adam_macchiato · Post by **adam_macchiato** » Sat Aug 06, 2022 12:56 pm

Thank you for your advise , Would you please suggest some setting about 1080p video for using V2 with rtx 3080 n 3090 ?
because for now , v2-s with stojo present , 1080p video , some close-up shot the face go blur , but v2-l with stojo present much better .

suggestion like : extraction px
phaze-A output px
enl scaling... etc . ?

thank you very much ,,,,,

Post by **torzdf** » Sat Aug 06, 2022 4:17 pm

adam_macchiato wrote: ↑Sat Aug 06, 2022 12:56 pm
Thank you for your advise , Would you please suggest some setting about 1080p video for using V2 with rtx 3080 n 3090 ?
because for now , v2-s with stojo present , 1080p video , some close-up shot the face go blur , but v2-l with stojo present much better .

suggestion like : extraction px
phaze-A output px
enl scaling... etc . ?

thank you very much ,,,,,

Not really. I don't have a card of those specs, nor all the answers about what works and what doesn't.

Yaboyscotty · Post by **Yaboyscotty** » Wed Aug 10, 2022 4:08 am

What type of batch size should I be looking at while using phaze a?

I have a 3090

Post by **torzdf** » Wed Aug 10, 2022 10:13 am

I couldn't tell you batch size, just based on settings. See here for finding the best batch size yourself:

viewtopic.php?p=388

MaxHunter · Post by **MaxHunter** » Thu Aug 11, 2022 1:40 am

Wow!! I went this entire time without knowing there were presets?!!! Is this in the tutorial?

Post by **torzdf** » Thu Aug 11, 2022 8:25 am

MaxHunter wrote: ↑Thu Aug 11, 2022 1:40 am
Wow!! I went this entire time without knowing there were presets?!!! Is this in the tutorial?

Yes. Linked from the very first post
viewtopic.php?p=5367#p5367

Ryzen1988 · Post by **Ryzen1988** » Thu Aug 11, 2022 9:22 am

Hey Guys, i have been tinkering and tweaking a lot with Phaze-A and i seem to have found a combination that has fairly high output resolution, good at swapping and for its size still sort of manageable.
Currently only at training iteration 60000+ but with a high quality general purpose face dataset of around 7000 images of individual persons the facial reconstruction already looks very good and its also already started to partially swap faces.
Normally this does not really happen with a general training dataset because every picture is of a different person. The fact that its already starting to partially swap faces in this early stage of training is very exciting.

When starting from scratch with training it would result in Nan very quickly, in like 50 iterations.
After a lot of tweaking i found that starting with Adabelief, learning rate down to 2.5e-5 and epsilon up to -8 was the trick to slowly and steady starting to learn, after around 2000 its i could raise the learning rate and epsilon pretty quickly in steps
back up to 5.5e-5 and Epsilon to -16

my choice was Efficientnetv2 B3 as base since after reading up on all the papers it looked like the most efficient and modern encoder. Mobilenet v3 was also cool exept the low max res cap.
Input was with 100% enc scaling at 300px and generated output was 512px. Training started with batch of 16 with ssim and logcosh.
At around 40k its i switched to batch of 4 with Lpips-vgg16 as L3 for accelerated sharpness and feature development of the faces.
Goal is for the network to be good enough at 200k general training to be used for specific swapping with as little retraining as possible. (that is why its fairly large model with some more splitting and layering that otherwise necessary)
One of the things was to not put the bottleneck in the encoder but keep it in the FC, this results in the G-bloc and split FC bloating the parameter count of the model but in return gives the inputs to the Gblock and split layers way more data to work with.
I try to avoid upscales and just use filters.

Code: Select all

{
    "output_size": 512,
    "shared_fc": "full",
    "enable_gblock": true,
    "split_fc": true,
    "split_gblock": false,
    "split_decoders": true,
    "enc_architecture": "efficientnet_v2_b3",
    "enc_scaling": 100,
    "enc_load_weights": false,
    "bottleneck_type": "dense",
    "bottleneck_norm": null,
    "bottleneck_size": 1536,
    "bottleneck_in_encoder": false,
    "fc_depth": 1,
    "fc_min_filters": 512,
    "fc_max_filters": 512,
    "fc_dimensions": 4,
    "fc_filter_slope": -0.5,
    "fc_dropout": 0.07,
    "fc_upsampler": "subpixel",
    "fc_upsamples": 0,
    "fc_upsample_filters": 1024,
    "fc_gblock_depth": 3,
    "fc_gblock_min_nodes": 512,
    "fc_gblock_max_nodes": 512,
    "fc_gblock_filter_slope": -0.5,
    "fc_gblock_dropout": 0.05,
    "dec_upscale_method": "subpixel",
    "dec_upscales_in_fc": 0,
    "dec_norm": null,
    "dec_min_filters": 16,
    "dec_max_filters": 512,
    "dec_slope_mode": "full",
    "dec_filter_slope": -0.45,
    "dec_res_blocks": 1,
    "dec_output_kernel": 5,
    "dec_gaussian": true,
    "dec_skip_last_residual": true
}

Picture will follow when target training iterations is reached.

Post by **torzdf** » Thu Aug 11, 2022 2:46 pm

Thanks for this post! This is exactly the kind of information and research that I would like to see from users.

I look forward to seeing how your model gets on

adam_macchiato · Post by **adam_macchiato** » Sun Aug 14, 2022 7:08 am

Just want to share my case , and hope any suggestion ,

I ｈave 2 PC , 1 with 3090 other with 3080 ,
if output 256px , 3090 can use Stojo model with V2_l , quality very good and impression
3080 can use stojo with V2_s

because for 1080p+ video , some close-up shot face go blur , so i start to try 512px output , but some setting result is out of vram ,
i share my simple setting as below :

3090 , Extraction 1024px . 512px output

Model	Encoder	Scaling	Batchsize	Result
Stojo	V2_M	100	2/4/8	Fail
Stojo	V2_S	100	8	Success
Stojo	V2_B3	100	12	Success

3080 , Extraction 780px . 512px output

Model	Encoder	Scaling	Batchsize	Result
Stojo	V2_S	100	2/4/8	Fail
Stojo	V2_B3	100	2/4/8	Fail
DNY512	V2_S	100	2/4/8	Fail
DNY512	V2_B3	100	2/4/8	Fail
DNY512	-	-	8	Success
DNY1024	-	-	8	Success
SAEDF-HD	V2-S	100	BS 2/4/8	Fail
SAEDF-HD	V2_B3	100	2/4/8	Fail

any good suggestion with V2 setting can go 512px ?

thanks all

Edit by @torzdf: Inserted tables

Post by **torzdf** » Sun Aug 14, 2022 9:48 am

Unfortunately my larger GPU is not available for testing at the moment.

The DNY preset is fairly simple/lightweight so I would definitely start from that (StoJo, whilst lower Res is much more complex).

Generally speaking (all things being equal) a doubling of resolution will need 4x as much VRAM. I would start with the V2_B0 encoder and go from there.

You may also be able to save some VRAM by changing the Bottleneck to a pooling layer rather than a Dense. Again, this will reduce complexity, but worth a try.

Also, extraction size does not impact VRAM usage (you may know this already, but I saw you mentioned extraction size in your post so thought I'd mention it)

Icarus · Post by **Icarus** » Mon Aug 15, 2022 10:13 pm

I've been experimenting with Phaze A for a year now using Nvidia A100 cloud GPUs and have tried a few common and 1 not so common setup and wanted to share some of my notes on how different model architectures effect results.

split fc layer, gblock enabled (not split), shared decoders:
This is probably the most popular setup and is the best choice if your A data has a lot of poses / angles that your B data lacks. The shared decoder is really good at filling in the blanks, however there tends to be a fair amount of identity bleed.

shared fc layer, gblock enabled (not split), split decoder:
This has produced the worst results and in my experience has been the only setup to cause discoloration in the forehead when the hairlines differ.

split fc layer, gblock enabled (not split), split decoder:
This is the least common setup but my personal favorite when you have a good amount of B data. This setup results in strikingly accurate detail and is the closest thing to actually swapping the face with 0 identity bleed. The downside to this setup is when you don't have enough B data to fill in the blanks. The model does some frightening things when it only has the G-block (a GAN) to fill in the blanks.

I did a few experiments with a split g-block but I didn't really notice any significant improvement or degradation either way...

A few notes on what I've found to be ideal settings: :
Encoder: Efficientnet2_L has been amazing and I've noticed a huge improvement over v1. I usually try to match the scale with the output.

Bottleneck: Always go with Dense, I've tried both poolings and they result in streaks of color and poor detail. Using the 512 size has never let me down.

fc layer: overcranking this can do more harm than good. With autoencoder models, you generally don't want this to be more detailed than the encoder feeding it. I've noticed better results with a dimension of 8 and 1280 nodes than with a dimension of 12 or 16. On that note, making this deeper (increasing the depth over 1), is unnecessary, a waste of VRAM and, at least in my experience, did nothing to improve the results and may have made them worse.

fc_dropout: I hardly ever use this but the 1 time I did, it surprisingly sped up training massively (which seems counterintuitive).

Upsamplers: Since I was doing most of the training on a powerful GPU, I used subpixel for both upsamplers. I would say 512 is probably a decent amount of filters. I had this at 1280 but eventually dropped it to 512 and didn't notice any degradation in results.

Decoders: Allocating more VRAM to these parameters will give you the most bang for your buck in terms of detail in the results. I noticed a huge increase in detail and quality by increasing both first and final amount of filters. If you run into VRAM issues, adjusting the slope of the filter curve (making it steeper) can save you Vram. Adding an additional residual block (or 2) also made a huge difference. I go with kernal size of 3 but have also used the default of 5 a few times and it's hard to say if it made much of a positive difference because other parameters were also changed.

Loss functions: :
As it says in the Training Guide, the choice you make here will have an outsized impact on your entire model. I've tried the all and a combination of MS_SSIM and MAE (L1) at 100% have produced the best results. The weird quirk with MS_SSIM is whenever I've tried to start a model using it, my model crashes (which I honestly can't explain. So I usually start with SSIM then swap it out for MS_SSIM after 1k iterations. I also add a 3rd loss function, ffl at either 25% or 50% and I think it has made a positive impact. I've tried the lpips as tertiary losses and it completely ruined everything with the moire pattern described in the settings. I get that in theory using one of those as a supplimentary loss is supposed to help but I have no idea of how much weight to give it.

Mixed Precission: :
Last but not least, Mixed Precision. You love it and you hate it. It does make a huge difference in training speed and VRAM but is the frequent culprit of NaNs. I did some research on Nvidia's website regarding this and I found the holy grail of hidden information that has cured me of the downside to using it. It all comes down to the Epsilon. Nvidia recommends increasing your epsilon by 1e-3 when training with Mixed Precision. So instead of the default 1e-07, I use 1e-04 and this has made the world of difference with 0 downside in terms of the models ability to learn and most importantly no more NaNs.

These are just a few things I've noticed after experimenting a bit through trial and error and these findings are by no means scientific and would never pass a peer review

I usually train until loss convergence and to around 600k - 800k iterations with a batch size of 8 and learning rate of 3e-05.

Post by **torzdf** » Tue Aug 16, 2022 8:31 am

@Icarus Thanks for this extensive post.Hugely useful.

Could I ask that you split/duplicate your Mixed Precision and Loss posts into separate posts (duplicating the info is fine), as I think both of those items are worthy of their own independent discussion. I'm happy to c+p the info myself, but thought I would give you the opportunity to start the thread.

As for the Phaze-A info, this is great! I have linked it from the User/Presets + Experimentation posts.

MaxHunter · Post by **MaxHunter** » Thu Aug 18, 2022 6:06 am

@Icarus

Thanks for the review. I've been having a hell of a time trying to get Stojo working as I keep getting NaNs- and it's driving me bonkers. Going to try your suggestions.

MaxHunter · Post by **MaxHunter** » Tue Aug 23, 2022 2:37 am

I have to mention I've been experimenting with the Stojo pre-set for the past several days, mixing it with Icarus' settings, but adding the third loss of lpips Alex at 5%, and the results in my opinion are outstanding. No nans at all, whereas before I couldn't get Stojo to go above 50,000 its without a NaN error. If you're looking for a general model for random "A" faces this might be your jumping off point, the results after 400,000 its and .08 loss are pretty good. I expect it to get even better as the loss slowly drops. I'd like to see it get down to .03 loss.

zany6669 · Post by **zany6669** » Tue Aug 30, 2022 3:30 am

Thanks for the great observations. Very helpful. I was wondering if is possible to change the G-BLock, split layer settings etc mid-training or does a new model need to be created?

Faceswap Forum

[Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Re: Loss cant go down spent over 48 hrs ?

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Output 512px with V2 is really a difficult Task

Re: [Guide] Introducing - Phaze-A

Notes on Phaze A model architecture and settings

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A

Re: [Guide] Introducing - Phaze-A