Faceswap Forum

Posted: **Thu Aug 01, 2019 6:23 am**

Hi Team,

Here I am again, reporting a crash this time. See below, the crash report.
Please help me!

Thanks again,
Vicky

08/01/2019 11:43:26 MainProcess     training_0      _base           __init__                  DEBUG    Initialized Trainer
08/01/2019 11:43:26 MainProcess     training_0      train           load_trainer              DEBUG    Loaded Trainer
08/01/2019 11:43:26 MainProcess     training_0      train           run_training_cycle        DEBUG    Running Training Cycle
08/01/2019 11:43:26 MainProcess     training_0      training_data   minibatch                 DEBUG    Launching minibatch generator for queue (side: 'a', is_display: False)
08/01/2019 11:43:26 MainProcess     training_0      _base           generate_preview          DEBUG    Generating preview
08/01/2019 11:43:26 MainProcess     training_0      _base           set_preview_feed          DEBUG    Setting preview feed: (side: 'a')
08/01/2019 11:43:26 MainProcess     training_0      _base           load_generator            DEBUG    Loading generator: a
08/01/2019 11:43:26 MainProcess     training_0      _base           load_generator            DEBUG    input_size: 64, output_shapes: [(64, 64, 3)]
08/01/2019 11:43:26 MainProcess     training_0      training_data   __init__                  DEBUG    Initializing TrainingDataGenerator: (model_input_size: 64, model_output_shapes: [(64, 64, 3)], training_opts: {'alignments': {'a': '/home/vicky/Vicky/Projects/facial/faceswap/dataset/cageO/alignments.json', 'b': '/home/vicky/Vicky/Projects/facial/faceswap/dataset/trumpO/alignments.json'}, 'preview_scaling': 0.5, 'warp_to_landmarks': False, 'augment_color': True, 'no_flip': False, 'pingpong': False, 'snapshot_interval': 25000, 'training_size': 256, 'no_logs': False, 'mask_type': None, 'coverage_ratio': 0.625}, landmarks: False, config: {'mask_type': None, 'icnr_init': False, 'conv_aware_init': False, 'subpixel_upscaling': False, 'reflect_padding': False, 'dssim_loss': True, 'penalized_mask_loss': True, 'preview_images': 14, 'zoom_amount': 5, 'rotation_range': 10, 'shift_range': 5, 'flip_chance': 50, 'color_lightness': 30, 'color_ab': 8, 'color_clahe_chance': 50, 'color_clahe_max_size': 4})
08/01/2019 11:43:26 MainProcess     training_0      training_data   set_mask_class            DEBUG    Mask class: None
08/01/2019 11:43:26 MainProcess     training_0      training_data   __init__                  DEBUG    Initializing ImageManipulation: (input_size: 64, output_shapes: [(64, 64, 3)], coverage_ratio: 0.625, config: {'mask_type': None, 'icnr_init': False, 'conv_aware_init': False, 'subpixel_upscaling': False, 'reflect_padding': False, 'dssim_loss': True, 'penalized_mask_loss': True, 'preview_images': 14, 'zoom_amount': 5, 'rotation_range': 10, 'shift_range': 5, 'flip_chance': 50, 'color_lightness': 30, 'color_ab': 8, 'color_clahe_chance': 50, 'color_clahe_max_size': 4})
08/01/2019 11:43:26 MainProcess     training_0      training_data   __init__                  DEBUG    Output sizes: [64]
08/01/2019 11:43:26 MainProcess     training_0      training_data   __init__                  DEBUG    Initialized ImageManipulation
08/01/2019 11:43:26 MainProcess     training_0      training_data   __init__                  DEBUG    Initialized TrainingDataGenerator
08/01/2019 11:43:26 MainProcess     training_0      training_data   minibatch_ab              DEBUG    Queue batches: (image_count: 319, batchsize: 14, side: 'a', do_shuffle: True, is_preview, True, is_timelapse: False)
08/01/2019 11:43:26 MainProcess     training_0      training_data   make_queues               DEBUG    ['preview_a_in', 'preview_a_out']
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   get_queue                 DEBUG    QueueManager getting: 'preview_a_in'
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   add_queue                 DEBUG    QueueManager adding: (name: 'preview_a_in', maxsize: 0)
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   add_queue                 DEBUG    QueueManager added: (name: 'preview_a_in')
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   get_queue                 DEBUG    QueueManager got: 'preview_a_in'
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   get_queue                 DEBUG    QueueManager getting: 'preview_a_out'
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   add_queue                 DEBUG    QueueManager adding: (name: 'preview_a_out', maxsize: 0)
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   add_queue                 DEBUG    QueueManager added: (name: 'preview_a_out')
08/01/2019 11:43:26 MainProcess     training_0      queue_manager   get_queue                 DEBUG    QueueManager got: 'preview_a_out'
08/01/2019 11:43:26 MainProcess     training_0      training_data   minibatch_ab              DEBUG    Batch shapes: [(14, 256, 256, 3), (14, 64, 64, 3), (14, 64, 64, 3)]
08/01/2019 11:43:26 MainProcess     training_0      multithreading  __init__                  DEBUG    Initializing FixedProducerDispatcher: (method: '<bound method TrainingDataGenerator.load_batches of <lib.training_data.TrainingDataGenerator object at 0x7f13347c59b0>>', shapes: [(14, 256, 256, 3), (14, 64, 64, 3), (14, 64, 64, 3)], ctype: <class 'ctypes.c_float'>, workers: 1, buffers: None)
08/01/2019 11:43:26 MainProcess     training_0      multithreading  __init__                  DEBUG    Initialized FixedProducerDispatcher
08/01/2019 11:43:26 MainProcess     training_0      training_data   minibatch_ab              DEBUG    Batching to queue: (side: 'a', is_display: True)
08/01/2019 11:43:26 MainProcess     training_0      _base           set_preview_feed          DEBUG    Set preview feed. Batchsize: 14
08/01/2019 11:43:26 MainProcess     training_0      training_data   minibatch                 DEBUG    Launching minibatch generator for queue (side: 'a', is_display: True)
08/01/2019 11:43:26 SpawnProcess-4  MainThread      multithreading  _runner                   DEBUG    FixedProducerDispatcher worker for <bound method TrainingDataGenerator.load_batches of <lib.training_data.TrainingDataGenerator object at 0x7f6ebee575c0>> started
08/01/2019 11:43:26 SpawnProcess-4  MainThread      training_data   load_batches              DEBUG    Loading batch: (image_count: 319, side: 'a', is_display: True, do_shuffle: True)
08/01/2019 11:43:26 MainProcess     training_0      _base           largest_face_index        DEBUG    0
08/01/2019 11:43:26 MainProcess     training_0      deprecation     new_func                  WARNING  From /home/vicky/miniconda3/envs/env_faceswap/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\nInstructions for updating:\nUse tf.cast instead.
08/01/2019 11:43:29 MainProcess     training_0      training_data   join_subprocess           DEBUG    Joining FixedProducerDispatcher
08/01/2019 11:43:29 SpawnProcess-2  MainThread      training_data   load_batches              DEBUG    Finished batching: (epoch: 128, side: 'a', is_display: False)
08/01/2019 11:43:29 SpawnProcess-2  MainThread      multithreading  _runner                   DEBUG    FixedProducerDispatcher worker for <bound method TrainingDataGenerator.load_batches of <lib.training_data.TrainingDataGenerator object at 0x7f60dff6a550>> shutdown
08/01/2019 11:43:29 MainProcess     training_0      training_data   join_subprocess           DEBUG    Joined FixedProducerDispatcher
08/01/2019 11:43:29 MainProcess     training_0      training_data   join_subprocess           DEBUG    Joining FixedProducerDispatcher
08/01/2019 11:43:29 SpawnProcess-3  MainThread      training_data   load_batches              DEBUG    Finished batching: (epoch: 128, side: 'b', is_display: False)
08/01/2019 11:43:29 SpawnProcess-3  MainThread      multithreading  _runner                   DEBUG    FixedProducerDispatcher worker for <bound method TrainingDataGenerator.load_batches of <lib.training_data.TrainingDataGenerator object at 0x7f919a390550>> shutdown
08/01/2019 11:43:29 MainProcess     training_0      training_data   join_subprocess           DEBUG    Joined FixedProducerDispatcher
08/01/2019 11:43:29 MainProcess     training_0      multithreading  run                       DEBUG    Error in thread (training_0): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.\n	 [[{{node encoder/conv_0_conv2d/convolution}}]]\n	 [[{{node decoder_a/face_out/Sigmoid-2-0-TransposeNCHWToNHWC-LayoutOptimizer}}]]
08/01/2019 11:43:29 MainProcess     MainThread      train           monitor                   DEBUG    Thread error detected
08/01/2019 11:43:29 MainProcess     MainThread      train           monitor                   DEBUG    Closed Monitor
08/01/2019 11:43:29 MainProcess     MainThread      train           end_thread                DEBUG    Ending Training thread
08/01/2019 11:43:29 MainProcess     MainThread      train           end_thread                CRITICAL Error caught! Exiting...
08/01/2019 11:43:29 MainProcess     MainThread      multithreading  join                      DEBUG    Joining Threads: 'training'
08/01/2019 11:43:29 MainProcess     MainThread      multithreading  join                      DEBUG    Joining Thread: 'training_0'
08/01/2019 11:43:29 MainProcess     MainThread      multithreading  join                      ERROR    Caught exception in thread: 'training_0'
Traceback (most recent call last):
  File "/home/vicky/Vicky/Projects/facial/faceswap/lib/cli.py", line 122, in execute_script
    process.process()
  File "/home/vicky/Vicky/Projects/facial/faceswap/scripts/train.py", line 98, in process
    self.end_thread(thread, err)
  File "/home/vicky/Vicky/Projects/facial/faceswap/scripts/train.py", line 124, in end_thread
    thread.join()
  File "/home/vicky/Vicky/Projects/facial/faceswap/lib/multithreading.py", line 460, in join
    raise thread.err[1].with_traceback(thread.err[2])
  File "/home/vicky/Vicky/Projects/facial/faceswap/lib/multithreading.py", line 391, in run
    self._target(*self._args, **self._kwargs)
  File "/home/vicky/Vicky/Projects/facial/faceswap/scripts/train.py", line 150, in training
    raise err
  File "/home/vicky/Vicky/Projects/facial/faceswap/scripts/train.py", line 140, in training
    self.run_training_cycle(model, trainer)
  File "/home/vicky/Vicky/Projects/facial/faceswap/scripts/train.py", line 222, in run_training_cycle
    trainer.train_one_step(viewer, timelapse)
  File "/home/vicky/Vicky/Projects/facial/faceswap/plugins/train/trainer/_base.py", line 211, in train_one_step
    raise err
  File "/home/vicky/Vicky/Projects/facial/faceswap/plugins/train/trainer/_base.py", line 176, in train_one_step
    loss[side] = batcher.train_one_batch(do_preview)
  File "/home/vicky/Vicky/Projects/facial/faceswap/plugins/train/trainer/_base.py", line 276, in train_one_batch
    loss = self.model.predictors[self.side].train_on_batch(*batch)
  File "/home/vicky/miniconda3/envs/env_faceswap/lib/python3.6/site-packages/keras/engine/training.py", line 1217, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/vicky/miniconda3/envs/env_faceswap/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/home/vicky/miniconda3/envs/env_faceswap/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/home/vicky/miniconda3/envs/env_faceswap/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/home/vicky/miniconda3/envs/env_faceswap/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node encoder/conv_0_conv2d/convolution}}]]
	 [[{{node decoder_a/face_out/Sigmoid-2-0-TransposeNCHWToNHWC-LayoutOptimizer}}]]

============ System Information ============
encoding:            UTF-8
git_branch:          master
git_commits:         c3adc93 Update GUI Graph + Stats when model has finished saving. 2610eff Bugfix: GUI: Progress bar on times over 1 hour (extract/convert). c1c60a9 bugfix: Clip output from scaling in convert. 8b2f166 Update helptext for CA Initialization. b6c830c Bugfix: Alignments tool: Correctly set items attribute on Check job
gpu_cuda:            10.1
gpu_cudnn:           7.6.0
gpu_devices:         GPU_0: GeForce RTX 2080 Ti
gpu_devices_active:  GPU_0
gpu_driver:          418.56
gpu_vram:            GPU_0: 10986MB
os_machine:          x86_64
os_platform:         Linux-4.18.0-25-generic-x86_64-with-debian-buster-sid
os_release:          4.18.0-25-generic
py_command:          /home/vicky/Vicky/Projects/facial/faceswap/faceswap.py train -A /home/vicky/Vicky/Projects/facial/faceswap/dataset/cageO -B /home/vicky/Vicky/Projects/facial/faceswap/dataset/trumpO -m /home/vicky/Vicky/Projects/facial/faceswap/dataset/trump-cage-model -t original -s 100 -ss 25000 -bs 64 -it 1000000 -g 1 -ps 50 -L INFO -gui
py_conda_version:    conda 4.7.10
py_implementation:   CPython
py_version:          3.6.6
py_virtual_env:      True
sys_cores:           8
sys_processor:       x86_64
sys_ram:             Total: 32102MB, Available: 20682MB, Used: 10020MB, Free: 2429MB

=============== Pip Packages ===============
absl-py==0.7.1
astor==0.7.1
astroid==2.2.5
certifi==2019.6.16
cloudpickle==1.2.1
cycler==0.10.0
cytoolz==0.10.0
dask==2.1.0
decorator==4.4.0
fastcluster==1.1.25
ffmpy==0.2.2
gast==0.2.2
google-pasta==0.1.7
grpcio==1.14.1
h5py==2.9.0
imageio==2.5.0
imageio-ffmpeg==0.3.0
isort==4.3.21
joblib==0.13.2
Keras==2.2.4
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
lazy-object-proxy==1.4.1
Markdown==3.1.1
matplotlib==2.2.2
mccabe==0.6.1
mock==3.0.5
networkx==2.3
numpy==1.16.2
nvidia-ml-py3==7.352.1
olefile==0.46
opencv-python==4.1.0.25
pathlib==1.0.1
Pillow==5.1.0
protobuf==3.8.0
psutil==5.6.3
pylint==2.3.1
pyparsing==2.4.0
python-dateutil==2.8.0
pytz==2019.1
PyWavelets==1.0.3
PyYAML==5.1.1
scikit-image==0.15.0
scikit-learn==0.21.2
scipy==1.3.0
six==1.12.0
tensorboard==1.13.1
tensorflow==1.13.1
tensorflow-estimator==1.13.0
termcolor==1.1.0
toolz==0.10.0
toposort==1.5
tornado==6.0.3
tqdm==4.32.1
typed-ast==1.4.0
Werkzeug==0.15.4
wrapt==1.11.2

============== Conda Packages ==============
# packages in environment at /home/vicky/miniconda3/envs/env_faceswap:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  

_tflow_select             2.1.0                       gpu  

absl-py                   0.7.1                    py36_0  

astor                     0.7.1                    py36_0  

blas                      1.0                    openblas  

bzip2                     1.0.8                h516909a_0    conda-forge
c-ares                    1.15.0            h7b6447c_1001  

ca-certificates           2019.5.15                     0  

cairo                     1.14.12              h77bcde2_0  

certifi                   2019.6.16                py36_1  

cloudpickle               1.2.1                      py_0  

cudatoolkit               10.0.130                      0  

cudnn                     7.6.0                cuda10.0_0  

cupti                     10.0.130                      0  

cycler                    0.10.0                   py36_0  

cytoolz                   0.10.0           py36h7b6447c_0  

dask-core                 2.1.0                      py_0  

dbus                      1.13.2               hc3f9b76_0  

decorator                 4.4.0                    py36_1  

expat                     2.2.5             he1b5a44_1003    conda-forge
ffmpeg                    4.0                  h04d0a96_0  

fontconfig                2.12.6               h49f89f6_0  

freetype                  2.8                  hab7d2ae_1  

gast                      0.2.2                    py36_0  

gettext                   0.19.8.1          hc5be6a0_1002    conda-forge
giflib                    5.1.9                h516909a_0    conda-forge
glib                      2.53.6               h5d9569c_2  

gmp                       6.1.2             hf484d3e_1000    conda-forge
gnutls                    3.6.5             hd3a4fd2_1002    conda-forge
google-pasta              0.1.7                      py_0  

graphite2                 1.3.13            hf484d3e_1000    conda-forge
grpcio                    1.14.1           py36h9ba97e2_0  

gst-plugins-base          1.12.4               h33fb286_0  

gstreamer                 1.12.4               hb53b477_0  

h5py                      2.9.0                    pypi_0    pypi
harfbuzz                  1.7.6                hc5b324e_0  

hdf5                      1.10.2               hba1933b_1  

icu                       58.2                 h9c2bf20_1  

imageio                   2.5.0                    py36_0  

jasper                    1.900.1           h07fcdf6_1006    conda-forge
jpeg                      9c                h14c3975_1001    conda-forge
keras                     2.2.4                         0  

keras-applications        1.0.8                      py_0  

keras-base                2.2.4                    py36_0  

keras-preprocessing       1.1.0                      py_1  

kiwisolver                1.1.0            py36he6710b0_0  

lame                      3.100             h14c3975_1001    conda-forge
libblas                   3.8.0               10_openblas    conda-forge
libcblas                  3.8.0               10_openblas    conda-forge
libedit                   3.1.20181209         hc058e9b_0  

libffi                    3.2.1                hd88cf55_4  

libgcc-ng                 9.1.0                hdf63c60_0  

libgfortran-ng            7.3.0                hdf63c60_0  

libiconv                  1.15              h516909a_1005    conda-forge
liblapack                 3.8.0               10_openblas    conda-forge
liblapacke                3.8.0               10_openblas    conda-forge
libopenblas               0.3.6                h6e990d7_6    conda-forge
libopus                   1.3                  h7b6447c_0  

libpng                    1.6.37               hed695b0_0    conda-forge
libprotobuf               3.8.0                hd408876_0  

libstdcxx-ng              9.1.0                hdf63c60_0  

libtiff                   4.0.10            h57b8799_1003    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libvpx                    1.7.0                h439df22_0  

libwebp                   1.0.2                h576950b_1    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxml2                   2.9.9                hea5a465_1  

lz4-c                     1.8.3             he1b5a44_1001    conda-forge
markdown                  3.1.1                    py36_0  

matplotlib                2.2.2            py36h0e671d2_1  

mock                      3.0.5                    py36_0  

ncurses                   6.1                  he6710b0_1  

nettle                    3.4.1             h1bed415_1002    conda-forge
networkx                  2.3                        py_0  

numpy                     1.16.4           py36h95a1406_0    conda-forge
olefile                   0.46                     py36_0  

openblas                  0.3.6                h6e990d7_6    conda-forge
opencv                    3.4.1            py36h6fd60c2_1  

openh264                  1.8.0             hdbcaa40_1000    conda-forge
openssl                   1.0.2s               h7b6447c_0  

pathlib                   1.0.1                    py36_1  

pcre                      8.41              hf484d3e_1003    conda-forge
pillow                    5.1.0            py36h3deb7b8_0  

pip                       19.1.1                   py36_0  

pixman                    0.38.0            h516909a_1003    conda-forge
protobuf                  3.8.0            py36he6710b0_0  

pthread-stubs             0.4               h14c3975_1001    conda-forge
pyparsing                 2.4.0                      py_0  

pyqt                      5.9.2            py36h751905a_0  

python                    3.6.6                h6e4f718_2  

python-dateutil           2.8.0                    py36_0  

pytz                      2019.1                     py_0  

pywavelets                1.0.3            py36hdd07704_1  

pyyaml                    5.1.1            py36h7b6447c_0  

qt                        5.9.4                h4e5bff0_0  

readline                  7.0                  h7b6447c_5  

scikit-image              0.15.0           py36he6710b0_0  

scipy                     1.3.0            py36he2b7bc3_0  

setuptools                41.0.1                   py36_0  

sip                       4.19.8           py36hf484d3e_0  

six                       1.12.0                   py36_0  

sqlite                    3.29.0               h7b6447c_0  

tensorboard               1.13.1           py36hf484d3e_0  

tensorflow                1.13.1          gpu_py36h3991807_0  

tensorflow-base           1.13.1          gpu_py36h8d69cac_0  

tensorflow-estimator      1.13.0                     py_0  

tensorflow-gpu            1.13.1               h0d30ee6_0  

termcolor                 1.1.0                    py36_1  

tk                        8.6.8                hbc83047_0  

toolz                     0.10.0                     py_0  

tornado                   6.0.3            py36h7b6447c_0  

tqdm                      4.32.1                     py_0  

werkzeug                  0.15.4                     py_0  

wheel                     0.33.4                   py36_0  

wrapt                     1.11.2           py36h7b6447c_0  

x264                      1!152.20180806       h14c3975_0    conda-forge
xorg-kbproto              1.0.7             h14c3975_1002    conda-forge
xorg-libice               1.0.10               h516909a_0    conda-forge
xorg-libsm                1.2.3             h84519dc_1000    conda-forge
xorg-libx11               1.6.8                h516909a_0    conda-forge
xorg-libxau               1.0.9                h14c3975_0    conda-forge
xorg-libxdmcp             1.1.3                h516909a_0    conda-forge
xorg-libxext              1.3.4                h516909a_0    conda-forge
xorg-libxrender           0.9.10            h516909a_1002    conda-forge
xorg-renderproto          0.11.1            h14c3975_1002    conda-forge
xorg-xextproto            7.3.0             h14c3975_1002    conda-forge
xorg-xproto               7.0.31            h14c3975_1007    conda-forge
xz                        5.2.4                h14c3975_4  

yaml                      0.1.7                had09818_2  

zlib                      1.2.11               h7b6447c_3  

zstd                      1.4.0                h3b9ef0a_0    conda-forge

Posted: **Thu Aug 01, 2019 10:30 am**

In the first instance, try a reboot. This normally means something has gone wrong in Cuda/cuDNN which needs a reboot to fix.

Posted: **Sat Jun 20, 2020 5:17 pm**

Hello, I've just taken delivery of a new 2060 Super.

2060 Super is the current recommended baseline hardware for Faceswap. I assumed I'd be able to move up to slightly more hefty models than DFaker or Original, compared to the RX 580 I was using. I have the latest Studio drivers installed.

Trying to use DLight to train, absolutely stock default settings except 80% face coverage and extended mask.

Unless I use a batch size of 2 (!!), I get out of memory errors.
With a BS of 2, I'm getting EGs around 3-4.
This is under Windows 10, not running anything else in the background.
Ryzen 7 3700X, 32GB RAM.

I've had CUDA Sync Errors, and just had "CUDA_ERROR_LAUNCH_FAILED"

Any suggestions would be appreciated.

Posted: **Sat Jun 20, 2020 5:20 pm**

Windows 10 does reserve a hefty amount of memory for itself even if nothing else is getting run. That may be taking you down significantly, but I think you should be able to get higher BSes than 2. If you run nvidia-smi (try inside your faceswap env, and if it doesn't show up there, it's somewhere inside the c:/nvidia folder as well) it should tell you how much vram is available before the training starts.

You may also benefit from "allow-growth" that sometimes works around some memory issues, but keep an eye on it since it just starting is no guarantee it'll be stable.

Posted: **Sat Jun 20, 2020 5:27 pm**

Thanks Bryan.

nvidia-smi reports only 939MiB of 8192MiB is used. Well, that is with Firefox open.
I will try with "allow growth".

I don't mind slower training, but this is disappointing so far: I never had ONE single memory related error on my RX 580.

I just had this crash:

Code: Select all

2020-06-20 18:22:29.652652: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-06-20 18:22:29.652806: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

Posted: **Sat Jun 20, 2020 5:29 pm**

Those errors could just mean that they weren't able to load the data that they have on the card, or could be that the drivers are messing up. My advice is to (in this order) install the latest drivers, reboot, and close all applications to try again.

Posted: **Sat Jun 20, 2020 5:33 pm**

Trying with allow growth, no other settings changed. Getting a lot of this:

Code: Select all

2020-06-20 18:29:20.692407: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 2.49G (2673629440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

There's more:

Code: Select all

2020-06-20 18:29:10.471097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-06-20 18:29:10.650982: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-06-20 18:29:11.485247: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-06-20 18:29:11.611848: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 2.49G (2673629440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-06-20 18:29:18.005698: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.11GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
(... a few of these)
2020-06-20 18:29:18.460946: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2020-06-20 18:29:19.920119: W tensorflow/core/common_runtime/bfc_allocator.cc:305] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2020-06-20 18:29:19.946687: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 2.49G (2673629440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(... many of these)
2020-06-20 18:29:21.211562: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 2.49G (2673629440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
(... a few of these)
2020-06-20 18:29:37.043250: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 2.49G (2673629440 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-06-20 18:29:37.043354: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.21MiB (rounded to 1267200).  Current allocation summary follows.
2020-06-20 18:29:37.043476: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): 	Total Chunks: 261, Chunks in use: 261. 65.3KiB allocated for chunks. 65.3KiB in use in bin. 9.5KiB client-requested in use in bin.
2020-06-20 18:29:37.043595: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512): 	Total Chunks: 51, Chunks in use: 51. 25.5KiB allocated for chunks. 25.5KiB in use in bin. 25.5KiB client-requested in use in bin.
2020-06-20 18:29:37.043701: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024): 	Total Chunks: 41, Chunks in use: 41. 41.8KiB allocated for chunks. 41.8KiB in use in bin. 41.0KiB client-requested in use in bin.
2020-06-20 18:29:37.043810: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2048): 	Total Chunks: 63, Chunks in use: 63. 126.0KiB allocated for chunks. 126.0KiB in use in bin. 126.0KiB client-requested in use in bin.
2020-06-20 18:29:37.043918: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4096): 	Total Chunks: 17, Chunks in use: 17. 93.0KiB allocated for chunks. 93.0KiB in use in bin. 93.0KiB client-requested in use in bin.
2020-06-20 18:29:37.044028: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8192): 	Total Chunks: 13, Chunks in use: 13. 123.5KiB allocated for chunks. 123.5KiB in use in bin. 121.9KiB client-requested in use in bin.
2020-06-20 18:29:37.044411: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16384): 	Total Chunks: 13, Chunks in use: 13. 224.5KiB allocated for chunks. 224.5KiB in use in bin. 224.5KiB client-requested in use in bin.
2020-06-20 18:29:37.044538: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (32768): 	Total Chunks: 4, Chunks in use: 4. 144.0KiB allocated for chunks. 144.0KiB in use in bin. 144.0KiB client-requested in use in bin.
2020-06-20 18:29:37.044657: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (65536): 	Total Chunks: 11, Chunks in use: 11. 736.0KiB allocated for chunks. 736.0KiB in use in bin. 736.0KiB client-requested in use in bin.
2020-06-20 18:29:37.044784: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (131072): 	Total Chunks: 23, Chunks in use: 23. 3.94MiB allocated for chunks. 3.94MiB in use in bin. 3.89MiB client-requested in use in bin.
2020-06-20 18:29:37.044911: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (262144): 	Total Chunks: 1, Chunks in use: 1. 384.0KiB allocated for chunks. 384.0KiB in use in bin. 384.0KiB client-requested in use in bin.
2020-06-20 18:29:37.045036: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (524288): 	Total Chunks: 34, Chunks in use: 34. 19.35MiB allocated for chunks. 19.35MiB in use in bin. 18.88MiB client-requested in use in bin.
2020-06-20 18:29:37.045162: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1048576): 	Total Chunks: 8, Chunks in use: 8. 9.67MiB allocated for chunks. 9.67MiB in use in bin. 9.67MiB client-requested in use in bin.
2020-06-20 18:29:37.045287: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2097152): 	Total Chunks: 25, Chunks in use: 25. 57.40MiB allocated for chunks. 57.40MiB in use in bin. 56.25MiB client-requested in use in bin.
2020-06-20 18:29:37.045394: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4194304): 	Total Chunks: 19, Chunks in use: 19. 94.38MiB allocated for chunks. 94.38MiB in use in bin. 92.63MiB client-requested in use in bin.
2020-06-20 18:29:37.045504: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8388608): 	Total Chunks: 63, Chunks in use: 63. 566.00MiB allocated for chunks. 566.00MiB in use in bin. 562.50MiB client-requested in use in bin.
2020-06-20 18:29:37.045613: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16777216): 	Total Chunks: 18, Chunks in use: 18. 380.05MiB allocated for chunks. 380.05MiB in use in bin. 374.26MiB client-requested in use in bin.
2020-06-20 18:29:37.045723: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (33554432): 	Total Chunks: 1, Chunks in use: 1. 32.00MiB allocated for chunks. 32.00MiB in use in bin. 18.00MiB client-requested in use in bin.
2020-06-20 18:29:37.045834: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (67108864): 	Total Chunks: 15, Chunks in use: 15. 1.50GiB allocated for chunks. 1.50GiB in use in bin. 1.50GiB client-requested in use in bin.
2020-06-20 18:29:37.045948: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728): 	Total Chunks: 9, Chunks in use: 9. 1.44GiB allocated for chunks. 1.44GiB in use in bin. 1.15GiB client-requested in use in bin.
2020-06-20 18:29:37.046057: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-06-20 18:29:37.046161: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 1.21MiB was 1.00MiB, Chunk State:
2020-06-20 18:29:37.046225: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 1048576
2020-06-20 18:29:37.046281: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0B800000 next 1 of size 1280
2020-06-20 18:29:37.046342: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0B800500 next 165 of size 256
(... many of these)
2020-06-20 18:29:37.054866: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0B800600 next 166 of size 256
2020-06-20 18:29:37.056429: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0B855B00 next 116 of size 256
2020-06-20 18:29:37.056503: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0B855C00 next 112 of size 256
2020-06-20 18:29:37.056578: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0B855D00 next 18446744073709551615 of size 697088
2020-06-20 18:29:37.056663: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 4194304
2020-06-20 18:29:37.056724: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0BA00000 next 18446744073709551615 of size 4194304
2020-06-20 18:29:37.056810: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 4194304
2020-06-20 18:29:37.056871: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0BE00000 next 113 of size 2048
2020-06-20 18:29:37.056943: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0BE00800 next 130 of size 1024
2020-06-20 18:29:37.057016: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000B0BE00C00 next 131 of size 256
2020-06-20 18:29:37.414634: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BE78BFC00 next 614 of size 5811200
2020-06-20 18:29:37.414706: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BE7E4A800 next 633 of size 24729600
2020-06-20 18:29:37.414779: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BE95E0000 next 628 of size 16384
2020-06-20 18:29:37.414850: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BE95E4000 next 583 of size 9437184
2020-06-20 18:29:37.414923: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BE9EE4000 next 698 of size 109051904
2020-06-20 18:29:37.414996: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BF06E4000 next 563 of size 105963520
2020-06-20 18:29:37.415070: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BF6BF2000 next 697 of size 65536
2020-06-20 18:29:37.437637: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BF6C02000 next 640 of size 9728
2020-06-20 18:29:37.437711: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BF6C04600 next 704 of size 1267200
2020-06-20 18:29:37.437785: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BF6D39C00 next 726 of size 6656
2020-06-20 18:29:37.437857: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BF6D3B600 next 569 of size 150994944
2020-06-20 18:29:37.437934: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000BFFD3B600 next 729 of size 9437184
2020-06-20 18:29:37.438008: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C0063B600 next 653 of size 9437184
2020-06-20 18:29:37.438084: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C00F3B600 next 728 of size 224000
2020-06-20 18:29:37.438157: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C00F72100 next 665 of size 2048
2020-06-20 18:29:37.438229: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C00F72900 next 566 of size 9437184
2020-06-20 18:29:37.438302: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C01872900 next 585 of size 2048
2020-06-20 18:29:37.438375: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C01873100 next 725 of size 9437184
2020-06-20 18:29:37.438449: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C02173100 next 598 of size 2359296
2020-06-20 18:29:37.438529: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C023B3100 next 642 of size 2359296
2020-06-20 18:29:37.438603: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C025F3100 next 675 of size 9437184
2020-06-20 18:29:37.438675: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C02EF3100 next 732 of size 9437184
2020-06-20 18:29:37.438745: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C037F3100 next 646 of size 9437184
2020-06-20 18:29:37.438816: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C040F3100 next 674 of size 2048
2020-06-20 18:29:37.438889: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C040F3900 next 648 of size 1024
2020-06-20 18:29:37.438960: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C040F3D00 next 700 of size 589824
2020-06-20 18:29:37.439033: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C04183D00 next 627 of size 512
2020-06-20 18:29:37.439103: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C04183F00 next 584 of size 9437184
2020-06-20 18:29:37.439179: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C04A83F00 next 641 of size 9437184
2020-06-20 18:29:37.439255: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C05383F00 next 739 of size 9437184
2020-06-20 18:29:37.439331: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C05C83F00 next 425 of size 2359296
2020-06-20 18:29:37.439403: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C05EC3F00 next 596 of size 2359296
2020-06-20 18:29:37.439478: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C06103F00 next 733 of size 589824
2020-06-20 18:29:37.439550: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C06193F00 next 676 of size 589824
2020-06-20 18:29:37.439622: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C06223F00 next 734 of size 589824
2020-06-20 18:29:37.439693: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C062B3F00 next 570 of size 256
2020-06-20 18:29:37.439763: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C062B4000 next 632 of size 4718592
2020-06-20 18:29:37.464023: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C06734000 next 592 of size 18874368
2020-06-20 18:29:37.464101: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C07934000 next 611 of size 2048
2020-06-20 18:29:37.464173: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C07934800 next 735 of size 512
2020-06-20 18:29:37.464244: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C07934A00 next 679 of size 512
2020-06-20 18:29:37.464317: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C07934C00 next 609 of size 4096
2020-06-20 18:29:37.464390: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C07935C00 next 604 of size 147456
2020-06-20 18:29:37.464467: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C07959C00 next 736 of size 9437184
2020-06-20 18:29:37.464541: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C08259C00 next 737 of size 256
2020-06-20 18:29:37.464613: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C08259D00 next 742 of size 19200
2020-06-20 18:29:37.464686: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C0825E800 next 681 of size 256
2020-06-20 18:29:37.464757: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C0825E900 next 682 of size 256
2020-06-20 18:29:37.464829: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0000000C0825EA00 next 18446744073709551615 of size 196744704
2020-06-20 18:29:37.464911: I tensorflow/core/common_runtime/bfc_allocator.cc:914]      Summary of in-use Chunks by size:
2020-06-20 18:29:37.464982: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 261 Chunks of size 256 totalling 65.3KiB
2020-06-20 18:29:37.465051: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 51 Chunks of size 512 totalling 25.5KiB
2020-06-20 18:29:37.465119: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 39 Chunks of size 1024 totalling 39.0KiB
2020-06-20 18:29:37.465188: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 1280 totalling 1.3KiB
2020-06-20 18:29:37.465256: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 1536 totalling 1.5KiB
2020-06-20 18:29:37.465323: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 63 Chunks of size 2048 totalling 126.0KiB
2020-06-20 18:29:37.465392: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 7 Chunks of size 4096 totalling 28.0KiB
2020-06-20 18:29:37.465459: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 10 Chunks of size 6656 totalling 65.0KiB
2020-06-20 18:29:37.465535: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 13 Chunks of size 9728 totalling 123.5KiB
2020-06-20 18:29:37.465603: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 7 Chunks of size 16384 totalling 112.0KiB
2020-06-20 18:29:37.465670: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 6 Chunks of size 19200 totalling 112.5KiB
2020-06-20 18:29:37.465740: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 4 Chunks of size 36864 totalling 144.0KiB
2020-06-20 18:29:37.465810: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 10 Chunks of size 65536 totalling 640.0KiB
2020-06-20 18:29:37.465883: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 98304 totalling 96.0KiB
2020-06-20 18:29:37.465951: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 13 Chunks of size 147456 totalling 1.83MiB
2020-06-20 18:29:37.466020: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 196608 totalling 192.0KiB
2020-06-20 18:29:37.466088: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 9 Chunks of size 224000 totalling 1.92MiB
2020-06-20 18:29:37.466157: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 393216 totalling 384.0KiB
2020-06-20 18:29:37.466225: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 4 Chunks of size 524288 totalling 2.00MiB
2020-06-20 18:29:37.466293: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 28 Chunks of size 589824 totalling 15.75MiB
2020-06-20 18:29:37.489685: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 697088 totalling 680.8KiB
2020-06-20 18:29:37.489755: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 983040 totalling 960.0KiB
2020-06-20 18:29:37.489828: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 8 Chunks of size 1267200 totalling 9.67MiB
2020-06-20 18:29:37.489901: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 24 Chunks of size 2359296 totalling 54.00MiB
2020-06-20 18:29:37.489972: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 3562496 totalling 3.40MiB
2020-06-20 18:29:37.490042: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 4194304 totalling 4.00MiB
2020-06-20 18:29:37.490111: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 9 Chunks of size 4718592 totalling 40.50MiB
2020-06-20 18:29:37.490180: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 9 Chunks of size 5811200 totalling 49.88MiB
2020-06-20 18:29:37.490249: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 8388608 totalling 8.00MiB
2020-06-20 18:29:37.490318: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 62 Chunks of size 9437184 totalling 558.00MiB
2020-06-20 18:29:37.490390: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 8 Chunks of size 18874368 totalling 144.00MiB
2020-06-20 18:29:37.490462: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 19964160 totalling 19.04MiB
2020-06-20 18:29:37.490533: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 8 Chunks of size 24729600 totalling 188.67MiB
2020-06-20 18:29:37.490603: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 29715712 totalling 28.34MiB
2020-06-20 18:29:37.490673: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 33554432 totalling 32.00MiB
2020-06-20 18:29:37.490743: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 9 Chunks of size 105963520 totalling 909.49MiB
2020-06-20 18:29:37.490814: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 6 Chunks of size 109051904 totalling 624.00MiB
2020-06-20 18:29:37.490881: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 5 Chunks of size 150994944 totalling 720.00MiB
2020-06-20 18:29:37.490950: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 155083520 totalling 147.90MiB
2020-06-20 18:29:37.491022: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 176162048 totalling 168.00MiB
2020-06-20 18:29:37.491093: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 196744704 totalling 187.63MiB
2020-06-20 18:29:37.491166: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 267675648 totalling 255.28MiB
2020-06-20 18:29:37.491237: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 4.08GiB
2020-06-20 18:29:37.491309: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 4379901952 memory_limit_: 7053531546 available bytes: 2673629594 curr_region_allocation_bytes_: 8589934592
2020-06-20 18:29:37.491430: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit:                  7053531546
InUse:                  4379901952
MaxInUse:               6784897280
NumAllocs:                    4587
MaxAllocSize:           2406266368

2020-06-20 18:29:37.491597: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ******xx******************************************************************************************xx
2020-06-20 18:29:37.491705: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[5,5,99,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Posted: **Sat Jun 20, 2020 5:34 pm**

bryanlyon wrote: ↑Sat Jun 20, 2020 5:29 pm
Those errors could just mean that they weren't able to load the data that they have on the card, or could be that the drivers are messing up. My advice is to (in this order) install the latest drivers, reboot, and close all applications to try again.

This is with a fresh install of the drivers, today.

I did choose "Studio" drivers, rather than the "normal" Geforce drivers.

Posted: **Sat Jun 20, 2020 6:03 pm**

Now I'm getting the following:

Code: Select all

020-06-20 19:00:06.112961: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-06-20 19:00:06.300959: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-06-20 19:00:06.301184: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-06-20 19:00:06.305818: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-06-20 19:00:06.306084: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-06-20 19:00:06.306303: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-06-20 19:00:06.306439: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-06-20 19:00:06.311920: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-06-20 19:00:07.197532: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2020-06-20 19:00:07.198153: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
06/20/2020 19:00:07 ERROR    Caught exception in thread: '_training_0'
06/20/2020 19:00:08 ERROR    Got Exception on main handler:
Traceback (most recent call last):
File "C:\programs\faceswap\lib\cli\launcher.py", line 155, in execute_script
process.process()
File "C:\programs\faceswap\scripts\train.py", line 161, in process
self._end_thread(thread, err)
File "C:\programs\faceswap\scripts\train.py", line 201, in _end_thread
thread.join()
File "C:\programs\faceswap\lib\multithreading.py", line 121, in join
raise thread.err[1].with_traceback(thread.err[2])
File "C:\programs\faceswap\lib\multithreading.py", line 37, in run
self._target(*self._args, **self._kwargs)
File "C:\programs\faceswap\scripts\train.py", line 226, in _training
raise err
File "C:\programs\faceswap\scripts\train.py", line 216, in _training
self._run_training_cycle(model, trainer)
File "C:\programs\faceswap\scripts\train.py", line 305, in _run_training_cycle
trainer.train_one_step(viewer, timelapse)
File "C:\programs\faceswap\plugins\train\trainer\_base.py", line 316, in train_one_step
raise err
File "C:\programs\faceswap\plugins\train\trainer\_base.py", line 283, in train_one_step
loss[side] = batcher.train_one_batch()
File "C:\programs\faceswap\plugins\train\trainer\_base.py", line 424, in train_one_batch
loss = self._model.predictors[self._side].train_on_batch(model_inputs, model_targets)
File "C:\Users\calipheron\MiniConda3\envs\264882\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch
outputs = self.train_function(ins)
File "C:\Users\calipheron\MiniConda3\envs\264882\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "C:\Users\calipheron\MiniConda3\envs\264882\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "C:\Users\calipheron\MiniConda3\envs\264882\lib\site-packages\tensorflow_core\python\client\session.py", line 1472, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node encoder/conv_64_0_conv2d/convolution}}]]
[[loss/mul/_493]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node encoder/conv_64_0_conv2d/convolution}}]]

Posted: **Sat Jun 20, 2020 6:31 pm**

Cuda errors in that last lot of messages tipped me off as to the possible problem. Faceswap might be configured for Nvidia, but I was still using the old conda environment from the "AMD install."

I completely removed Anaconda / miniConda, Python and Faceswap. Made sure no conda files were in my AppData folders. Rebooted. Fresh install of Faceswap.

First try with the same settings as before, boom. No errors.
Training, Dlight model, BS of 4, using 7.7GB of 8GB VRAM.
EGs/sec of 9.8 - does this seem good?
No memory saving options at all. Should I bother with Allow Growth any more?

Anyway, for anyone moving from AMD to NVIDIA GPUs as I have - I strongly recommend a completely fresh install of Faceswap, conda, python, all of it!

Posted: **Sat Jun 20, 2020 6:44 pm**

Scratch that. Another crash.
Will try the latest gaming drivers instead.

Edit: Fresh install of the gaming ready drivers, v446.14 and after 40 minutes of training, no crashes.
So for Faceswapping with 2060 Super at least, I would say that the studio ready drivers are not recommended.

Edit: 8 hours of training with no problems, then:

Code: Select all

2020-06-21 04:55:31.940777: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-06-21 04:55:31.940897: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

Posted: **Sun Jun 21, 2020 10:03 am**

This looks like an upstream issue, to be honest ;(

https://github.com/tensorflow/tensorflow/issues/33536

Posted: **Sun Jun 21, 2020 10:20 am**

torzdf wrote: ↑Sun Jun 21, 2020 10:03 am
This looks like an upstream issue, to be honest ;(

https://github.com/tensorflow/tensorflow/issues/33536

Thank you for replying Torzdf! I take it this means we have to wait for Tensorflow itself to receive a fix?

Well, so far I have found that, unlike with my RX 580, I basically have to have Faceswap running without anything else otherwise it causes crashes.

Much faster iterations now, but I'm already missing the apparently much more graceful memory handling of PlaidML (??) with AMD hardware, as with that I could have firefox and a game open as well (with reduced performance of course) and it never ever crashed while training.

For clarity, my setup:
Ryzen 3700X, 32GB RAM, Asus 2060 Super 8GB, Win 10 Pro
Nvidia Game Ready Driver 446.14, conda 4.8.3, python 3.7.7, faceswap reports it is up to date as of now
Training with DLight, extended mask, no memory mitigation options / growth
BS of 4
Absolutely maxed out VRAM, but it trains without crashing as long as I run nothing else.

Edit:
I have found I HAVE to have "allow growth" active in Extract options, otherwise faceswap fails any extract operation.

Edit:
For anyone interested, I am now training with DFL-SAE, batch size 4, and that has worked reliably. EG/s of about 7-8.

Faceswap Forum

Failed to get convolution algorithm : CRASH

Failed to get convolution algorithm : CRASH

Re: Failed to get convolution algorithm : CRASH

[Mitigated] Brand new 2060 Super. Trouble training.

Re: Brand new 2060 Super. Trouble training.

Re: Brand new 2060 Super. Trouble training.

Re: Brand new 2060 Super. Trouble training.

Re: Brand new 2060 Super. Trouble training.

Re: Brand new 2060 Super. Trouble training.

Re: Brand new 2060 Super. Trouble training.

Re: Brand new 2060 Super. Trouble training.

Re: Brand new 2060 Super. Trouble training.

Re: Brand new 2060 Super. Trouble training.

Re: Brand new 2060 Super. Trouble training.