So I just read the exploring deepfake books, very cool so left a good review.
But in there ofcourse was the resolution doubling -> vram & compute x 4 rule
The color space conversion rgb -> bgr and the fact that you do the whole convolutional shabang for each of the tree colours.
It already said that the mask itselfs is a lower resolution.
Since at least for me almost all faces come from video, and go back to video why isn't YUV 4:2:0 more used?
That would only use one layer at full resolution and the chroma luma at 25% the size.
Think of all the Vram and compute saved or the possibility to increase the output resolution without the heavy penalty.
I see not really any reason why the same convolutions and dense layers would not work with that instead of BGR
But probably people already tried this so why isn't this a option?
From what i have read YUV is not supported as such, but it seems fairly easy to map U & V to a grayscale layer and transform it back at the end.