Great video, thank you! In case someone sees "UserWarning: semaphore_tracker": This issue appeared since the PyTorch 1.9 release on 15th June and can be fixed by downgrading to the previous version: !pip install torch==1.8.1 torchvision==0.9.1
Downgrading the PyTorch version worked for me as well! I would also like to add that I used a completely new colab notebook. Following from a brand new notebook, I first !pip installed torch==1.8.1 torchvision==0.9.1, then git cloned the repo, and finally !pip installed ninja. From there I was able to successfully run the train.py file.
I want to thank you for getting me into actually attempting this. It's been quite the learning experience. But it seems every time I have a hurrah moment it's immediately followed by a slow disappointing sigh. I've tried using anaconda environments on Windows 10 to get this going but that ended with: "RuntimeError: MALFORMED INPUT: lanes dont match". Colab gives me: "UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown". As far as my very limited understanding goes I'm assuming this means that the workers aren't working right :P Might be something to do with only having one CPU core per runtime?
I got these errors on the training step on both colab and PC, anybody know how to fix it? UserWarning: conv2d_gradfix not supported on PyTorch 1.10.0+cu111. Falling back to torch.nn.functional.conv2d(). warnings.warn(f'conv2d_gradfix not supported on PyTorch {torch.__version__}. Falling back to torch.nn.functional.conv2d().') RuntimeError: derivative for aten::grid_sampler_2d_backward is not implemented
Hi Jeff, this video was great thank you for creating it! Have you tried Projected GANS? It's much, much faster. I'd be interested to hear what you think. Would love it if you made a video showing how you can use that but still save the data to your Gdrive like you have here. Thanks for taking the time to share your knowledge
@@HeatonResearch Now, I wonder if it would be possible to use your trained models with this project: colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/StyleCLIP_global.ipynb (not sure what to replace there besides 'ffhq' though, maybe you can help?). Would be cool to see if it can generate, for example, a decent looking Xmas tree pictures from your model where you usually get only a messy looking pictures. Or if it can generate a fish based on a description of its color and shape.
This was a very informative video and I thank you for sharing your knowledge of this. I had this running on the colab notebook very well last month, but once I renewed my subscription and tried again today I got an error about how many workers colab is using, as follows : Loading training set... /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. Would you happen to know a fix for this, because the training does seem to freeze and not go anywhere on initial startup. I've waited ~30 minutes. Edit: nvm it just took very long lol, thanks again!
Thanks so much for this tutorial. There are really a few out there. But I still have some doubts hope you can reply: - How many images did you feed it with? Is there a optimal number? Like let's say, 200, 500, 1000? - Where can I VISUALLY see how training is progressing? Do the fakes_init or fakes000000 update at some point? Do I have a visual collage only if I make a Resume point? - Most importantly, HOW can I make a resume point? How do you come up with the Network file, is it from last training? Where can I find it? What about the Resume folder? Is the SNAP number meaning that after tick 10 it will make a resume point? This will create the fakes000100.png file? - WHEN is training to be considered terminated? Is it based on how many pics you feed the dataset, more pics= longer training or simple/similar pics=quick training? - After training is finished, this means I will get a .pkl trained file and I can generate GANs from here locally on my computer? - In the "Perform initial training" there is a second script to run "...-- snap 25 -- resume..." what is that? You don't mention it in the video. It would be great if you could mange to make a tutorial from start to end including resuming and pauses and issues on the network (cutting out waiting times obviously) in order to make the process more clear for newbies like me. Thanks so much. You're being incredibly helpful.
I like at least a couple thousand images. Many of the NVIDIA labs projects use 50K+. The individual files do not update, but there will be further fakesxxxxxx files generated, that is how I track it. You can resume, critical, I do that in the video, just restart from your last checkpoint. Just make sure you are checkpointing often enough that one happens BEFORE colab shuts you down. Training is "done" when you reach the specified number of kimg, however, you can stop as soon as the images look good, that is what I generally do. I will think about a start to end video! Good idea.
thanks Jeff, great video. Quick question, training stops after the first tick (duration of training 4 mins on google colab T4). Any ideas on what might be the problem ? no metrics are displayed and no errors either...
Hi Jeff, love your videos and love the way you teach. A question, the IMAGE_PATH variables is set to "fish" directory - is that correct? Shouldn't it point to "circuit" directory, since that is the one we are working on Or am I missing anything?
Amazing! thanks so much! First time exploring training GANs and you made it so clear and easy to understand :) one question though, I trained it on a data set of black glyphs on white background (images are 256X256 RGB), I was expecting the generated images to look somewhat like the glyphs, but they were RGB shapeless blobs, is there anything I can do to generate images more similar to the training dataset? thanks!
Thanks for this video Jeff! Can i ask you a question? When you run this training - I'm not really clear on what pkl file is being used initially as a base? Or are you training this model from absolute scratch? If you're training from absolute scratch is there a way to start with an existing pkl i'd like to use and where would I input this? (say ffhq model or wikiart if I were to do something more organic). Thanks so much!
If you do not specify a pkl file with --resume, a neural network with a random weight initialization is used. So yes, in this case I was going from absolute scratch.
@@HeatonResearch Thanks so much - makes sense! Can i ask you one more question? Once I started the training it says it's Training for 25000 kimg . As I understand it it's 25000 images. What does that mean exactly? As you mentioned your dataset was around 3000 images or so and mine I'm doing is around 2000. What does 25000 mean - where's that coming from or will the model turn my dataset number of images into 25000 internally?
So, it randomly samples from your images. Plus, due to ADA, it is also creating new images from your training set. Your training data could be only a couple thousand images, yet with ADA there may be millions of images in total. The kimg is how many real images the discriminator has seen (including augmented images). Also its K, so 100Kimg is 100 * 1000 images. Normally you think of Epochs (number of complete cycles over the training set). But with augmentation epoch is either meaningless or arbitrary.
@@HeatonResearch Jeff - Thank you! When generating images after the training is done - What is the max value I can input for the seed variable? I assume the minimum is 0? or can it be a negative number as well?
I'm new to the whole styleGAN topic and was super happy to find such a well prepared notebook and tutorial. Unfortunately after I perform the Initial Training and the first Tick starts to run it crashes with the following warnings. Any idea how to fix this? /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return forward_call(*input, **kwargs) /usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown len(cache))
Incredibly useful video, thanks for real. I just have a question, when using the 'resume' snippet part we just call the 'train.py' again. This got me thinking if the resume works exactly from where the last network snapshot pickle file stopped, or it kind ofmakes a new training instance on it? Sort of like building another layer, instead of continuing the construction?
Hi Jeff, love this video but I'm having an issue when I try and pair notebook with my dataset, I'm getting a message that reads 'No such file or directory' when I try and run the 1s command, and the link itself doesnt work. Any idea why this is ?
Hi there Jeff. So I've got this running, but it only seems to run for about five minutes and do a single tick. The log looks like this: tick 0 kimg 0.0 time 50s sec/tick 13.6 sec/kimg 3410.17 maintenance 36.6 cpumem 5.38 gpumem 11.32 augment 0.000 Evaluating metrics... /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return forward_call(*input, **kwargs) Admittedly I'm only trying this on 25 pictures - could that be the problem?
Hi! Thanks a lot for your videos, so helpful. I've only one problem that's pulling me crazy. I've imported my personal folders from drive, when i modify the path for resuming the training another folder is created in "experiment" with a linear-growing number but the .pkl file remain with the same name everytime, so i believe it leads to a sort of loop in the training process... What would be the most probable error? Hope to have explained clearly the problem, thank you a lot again
Hey Jeff ! Thank you so much for this video. I'm trying to train NVIDIA StyleGAN3 on my own dataset but i would like to add pictures as input. Can it be done with StyleGAN ? I'm trying to generate a picture of a flowers using two other flowers (what will be the result of this fusion). Of course I have a dataset with the two flowers parents & the child. Do you think is there any way to do this with this tool ?
Hey is it possible to Add a description of your specific image requirements? First I got an error that they're all should be squared, so I cropped all of them, now I get Error: Image width/height after scale and crop are required to be power-of-two 0% 0/250 [00:00
Hi, why do I get this error: AttributeError: 'Logger' object has no attribute 'isatty', when I perform initial training. And I don't see any pkl files on my drive
Hi ! This is an interesting video ! I try to run your notebook but I have this Error while converting images : "Image width/height after scale and crop are required to be power-of-two". So I look to the .py file and the error comes up because : width != 2 ** int(np.floor(np.log2(width))) . This equation is so weird .... Can you give us the image shape of your dataset so I can resize my images. Thanks a lot !
I have kept it training for over a month now, sometimes 1 get p-100 but mostly I get v-100 but how much more will I have to keep training it. Can @Jeff Heaton please say me minimum of how many days and maximum of how many days?
Hello, when I launch the initial training it gives me a problem when it tries to construct a network: Constructing networks... Setting up PyTorch plugin "bias_act_plugin"... Done. /content/stylegan2-ada-pytorch/torch_utils/ops/conv2d_gradfix.py:55: UserWarning: conv2d_gradfix not supported on PyTorch 1.10.0+cu111. Falling back to torch.nn.functional.conv2d(). warnings.warn(f'conv2d_gradfix not supported on PyTorch {torch.__version__}. Falling back to torch.nn.functional.conv2d().') Why is this uncompatible? I am using the same stuff as you
Hi Mr. Jeff, I want to ask that I follow your video step by step. However, the command line stops at the initial fake images, and it cannot update so far (i didn't get fakes_00100 or something like that), so what should I do? Thank you!
im stuck, i hit copy for the auth code....it copies from the google sign in page. and then it wont paste into the text field....just will not paste... how? ive looked online but no solution, is it a keyboard shortcut. its a keyboard shortcut, panic over.... it will only paste in properly if you put it in notepad first if copied from a docx....
I found that using a third neural network with the objective to keep the training balanced between the generator and the discriminator by augmenting both generated and real images, works very well. setups that would otherwise get stable very fast is very stable now. the "augmenter" the neural network that keeps the game stable only has 1 layer with a kernel size of 3 or 5. is it's objective is to keep the loss of the generator at around 0.65.
Hi ! Finally it s works wuth 256x256 ! Thanks to this video I can do my work ! Does anyone know how to save and load the model 's weights after training ? so I can re use the model after. Thanks !
Is there some code for running StyleGAN-ADA with other resolutions than powers of twos? Doesn't matter if it's on Tensorflow or PyTorch. I saw there are such options for StyleGAN2 but haven't found anything working for StyleGAN-ADA. I have a dataset with 32x48 and it feels like a huge waste to pad those images to 64x64. Also, have you used transfer learning for StyleGAN-ADA, and if you did do you have some tips on it? Like which model is best to transfer learn from? In the paper they said that similarity between datasets doesn't matter that much with transfer learning but the diversity of the datasets does.
That is a much deeper drive, but something I am investigating. I would love to do GAN at a normal video aspect ratio, such as HD or 4K. It would require rearchitecting the core StyleGAN neural network, and likely at least one of their CUDA functions.
I have run the notebook for days but no generation of results and moreover now it's showing some warning and i am not able to continue training... Please answer anyone
Thanks a lot for the video, it has clarify me a lot of doubts more beeing very knew in this gan world. I have been trying to perform my initial train with your notebook, the first steps are running perfect, also I have another notebook to clean, verify, crop and slice my images, but I´m having problems with the initial train. It starts the first tick (tick 0 kimg 0.0 time 1m 32s sec/tick 17.2 sec/kimg 4308.67 maintenance 74.5 cpumem 4.74 gpumem 11.32 augment 0.000), but then when it starts to evaluate the metrics I have this message that I not really sure how to fix it or where I can change the definition of the workers: Evaluating metrics... /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return forward_call(*input, **kwargs) /usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown len(cache)) I know is very hard to review each doubt that users have, but maybe if you can give a line where I can start or what I can do in this cases, I´ll thank you alot. :)
Dude I need to create transformer neural network based chatbot for project submission in 15 days. What should I do? Learn or use to make my project?🙏🙏🙏
Is it possible to train a GAN based on another already trained one? Like, if I take the one for faces (there are a few publically available, as I know), and then add more face photos, let's say, some fantasy creatures, would that work?
@@HeatonResearch cool! But what if the new image set is very very different from the first one? Like, you trained it on fish, and then re-trained it on people faces. Will it make any sense, make it better or worse in any way?
It is 2,500 US/month for full 1 x VT100 on Google or AWS :) I have run for 4,000 ticks now and FiDFUll50K still somewhere 122... :( custom dataset 1m images furniture
Thanks you for making a Great tutorial series of StyleGAN2 ADA. But I have some question about resume training, I want to know how to set the resume train from latest kimg. Can I set it at --kimg? For example 1. I have done my 10 snapshot and get network-xxxxx-0040.pkl 2. I resume on network-xxxxx-0040.pkl 3. But It start to train at 0 kimgs So I want to know how to set it start at 40 kimgs Thanks for anyone who answer this. (Sorry for my bad English :( )
That would require a change to StyleGAN, since it checkpoints based on number of steps, rather than anything related to time. But agree that would be optimal.
StyleGAN2 ADA is built to process square images. GANs in general do not have such a requirement. It would take considerable modification to StyleGAN2 to process non-square images; additionally, considerable inefficiencies would be introduced if the images were non-square. Additionally, the way StyleGAN2 is built, the dimensions MUST be a multiple of 2. So you jump right from 1024 to 2048, to 4096 to 8192. etc.
Seems there's a new issue. I was able to recreate it on a fresh google account, notebook, images and the issue persists. File "/usr/local/lib/python3.7/dist-packages/tensorboard/compat/__init__.py", line 42, in tf from tensorboard.compat import notf # noqa: F401 ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.7/dist-packages/tensorboard/compat/__init__.py) During handling of the above exception, another exception occurred: File "/usr/local/lib/python3.7/dist-packages/jax/_src/pretty_printer.py", line 54, in _can_use_color return sys.stdout.isatty() AttributeError: 'Logger' object has no attribute 'isatty' I'm too noob to figure out what's wrong but uninstalling tensorboard seems to allow training to continue. (Skipping tfevents export: No module named 'tensorboard') Funny story is I paid for colab pro+ while I was training with the free GPU and only got the error when I tried training with my pro+ plan.
@@shashankpalle5409 In the install StyleGAN2 tab add a # before it installs tensorboard #!pip install tensorboard==1.14.0 I'm not an expert so I don't know how to fix it properly.
@@shashankpalle5409 I did some more testing and it appears my previous suggestion does not work and you have to manually uninstall tensorboard. !pip uninstall tensorboard. Or, you can change the version to something that stylegan will ignore like tensorboard==1.14.1
@@veazix Thank you so much!!! I have spent way too long trying all kinds of things to get my copy of Jeff's notebook to work - and your solution to uninstall tensorboard solved the problem :) [It's weird though, because up until about 5 days ago, I had run the notebook at least 20 times with no problems. Then something changed, which your solution fixed] Thank, again!
I got this huge error list when trying to train my model for the first time. ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.7/dist-packages/tensorboard/compat/__init__.py) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/content/stylegan2-ada-pytorch/train.py", line 538, in main() # pylint: disable=no-value-for-parameter File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/click/decorators.py", line 21, in new_func return f(get_current_context(), *args, **kwargs) File "/content/stylegan2-ada-pytorch/train.py", line 531, in main subprocess_fn(rank=0, args=args, temp_dir=temp_dir) File "/content/stylegan2-ada-pytorch/train.py", line 383, in subprocess_fn training_loop.training_loop(rank=rank, **args) File "/content/stylegan2-ada-pytorch/training/training_loop.py", line 240, in training_loop stats_tfevents = tensorboard.SummaryWriter(run_dir) File "/usr/local/lib/python3.7/dist-packages/torch/utils/tensorboard/writer.py", line 220, in __init__ self._get_file_writer() File "/usr/local/lib/python3.7/dist-packages/torch/utils/tensorboard/writer.py", line 251, in _get_file_writer self.flush_secs, self.filename_suffix) File "/usr/local/lib/python3.7/dist-packages/torch/utils/tensorboard/writer.py", line 61, in __init__ log_dir, max_queue, flush_secs, filename_suffix) File "/usr/local/lib/python3.7/dist-packages/tensorboard/summary/writer/event_file_writer.py", line 72, in __init__ tf.io.gfile.makedirs(logdir) File "/usr/local/lib/python3.7/dist-packages/tensorboard/lazy.py", line 65, in __getattr__ return getattr(load_once(self), attr_name) File "/usr/local/lib/python3.7/dist-packages/tensorboard/lazy.py", line 97, in wrapper cache[arg] = f(arg) File "/usr/local/lib/python3.7/dist-packages/tensorboard/lazy.py", line 50, in load_once module = load_fn() File "/usr/local/lib/python3.7/dist-packages/tensorboard/compat/__init__.py", line 45, in tf import tensorflow File "/usr/local/lib/python3.7/dist-packages/tensorflow/__init__.py", line 51, in from ._api.v2 import compat File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/__init__.py", line 37, in from . import v1 File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/v1/__init__.py", line 30, in from . import compat File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/v1/compat/__init__.py", line 37, in from . import v1 File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/v1/compat/v1/__init__.py", line 47, in from tensorflow._api.v2.compat.v1 import lite File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/v1/lite/__init__.py", line 9, in from . import experimental File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/v1/lite/experimental/__init__.py", line 8, in from . import authoring File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/v1/lite/experimental/authoring/__init__.py", line 8, in from tensorflow.lite.python.authoring.authoring import compatible File "/usr/local/lib/python3.7/dist-packages/tensorflow/lite/python/authoring/authoring.py", line 43, in from tensorflow.lite.python import convert File "/usr/local/lib/python3.7/dist-packages/tensorflow/lite/python/convert.py", line 29, in from tensorflow.lite.python import util File "/usr/local/lib/python3.7/dist-packages/tensorflow/lite/python/util.py", line 51, in from jax import xla_computation as _xla_computation File "/usr/local/lib/python3.7/dist-packages/jax/__init__.py", line 59, in from .core import eval_context as ensure_compile_time_eval File "/usr/local/lib/python3.7/dist-packages/jax/core.py", line 47, in import jax._src.pretty_printer as pp File "/usr/local/lib/python3.7/dist-packages/jax/_src/pretty_printer.py", line 56, in CAN_USE_COLOR = _can_use_color() File "/usr/local/lib/python3.7/dist-packages/jax/_src/pretty_printer.py", line 54, in _can_use_color return sys.stdout.isatty() AttributeError: 'Logger' object has no attribute 'isatty'
I'd love to see a 4k GAN
please
dreams
Totally. MidJourney + Topaz yielded some pretty cool results in the meantime.
Thank you for this Jeff, amazing as always
Great video, thank you! In case someone sees "UserWarning: semaphore_tracker": This issue appeared since the PyTorch 1.9 release on 15th June and can be fixed by downgrading to the previous version:
!pip install torch==1.8.1 torchvision==0.9.1
Downgrading the PyTorch version worked for me as well! I would also like to add that I used a completely new colab notebook. Following from a brand new notebook, I first !pip installed torch==1.8.1 torchvision==0.9.1, then git cloned the repo, and finally !pip installed ninja. From there I was able to successfully run the train.py file.
Sir, I love you so much.
Thank you so much for tracking this down! I've incorporated the above line into the repo.
Thanks so much! This is going to be exceptionally helpful in implementing save states for the first time in my cyclegan!
Thanks for these Gans videos
You are most welcome, GANs are fun.
Thank you for updating us...
You are most welcome!
This is amazing to hear and see to be honest
This feels like a really useful resource.
Much appreciate it Dr Heaton
Very nice video; thanks for the detailed explanation.
Great vídeo! Thanks a lot!
another great video. thanks.
Thanks!
I want to thank you for getting me into actually attempting this. It's been quite the learning experience. But it seems every time I have a hurrah moment it's immediately followed by a slow disappointing sigh. I've tried using anaconda environments on Windows 10 to get this going but that ended with: "RuntimeError: MALFORMED INPUT: lanes dont match".
Colab gives me: "UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown". As far as my very limited understanding goes I'm assuming this means that the workers aren't working right :P Might be something to do with only having one CPU core per runtime?
I have experienced similar issues. I have a decent GPU, code runs in colab cant get it to run on my pc. Even tried installing linux.
Great video
I got these errors on the training step on both colab and PC, anybody know how to fix it?
UserWarning: conv2d_gradfix not supported on PyTorch 1.10.0+cu111. Falling back to torch.nn.functional.conv2d().
warnings.warn(f'conv2d_gradfix not supported on PyTorch {torch.__version__}. Falling back to torch.nn.functional.conv2d().')
RuntimeError: derivative for aten::grid_sampler_2d_backward is not implemented
im getting this same error. Does anyone know how to fix?
Hi Jeff, this video was great thank you for creating it! Have you tried Projected GANS? It's much, much faster. I'd be interested to hear what you think. Would love it if you made a video showing how you can use that but still save the data to your Gdrive like you have here. Thanks for taking the time to share your knowledge
Now, this is very useful, thank you! Do you share your own GANs that you’ve trained on fish, Minecraft and Xmas trees?
Yes I do! They are all here, the ones starting with "pretrained" github.com/jeffheaton?tab=repositories
@@HeatonResearch Now, I wonder if it would be possible to use your trained models with this project: colab.research.google.com/github/orpatashnik/StyleCLIP/blob/main/notebooks/StyleCLIP_global.ipynb (not sure what to replace there besides 'ffhq' though, maybe you can help?). Would be cool to see if it can generate, for example, a decent looking Xmas tree pictures from your model where you usually get only a messy looking pictures. Or if it can generate a fish based on a description of its color and shape.
This was a very informative video and I thank you for sharing your knowledge of this. I had this running on the colab notebook very well last month, but once I renewed my subscription and tried again today I got an error about how many workers colab is using, as follows :
Loading training set...
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
Would you happen to know a fix for this, because the training does seem to freeze and not go anywhere on initial startup. I've waited ~30 minutes.
Edit: nvm it just took very long lol, thanks again!
I've been training for a few days. What should I do if the FID value increases little by little without decreasing?
Thanks so much for this tutorial. There are really a few out there. But I still have some doubts hope you can reply:
- How many images did you feed it with? Is there a optimal number? Like let's say, 200, 500, 1000?
- Where can I VISUALLY see how training is progressing? Do the fakes_init or fakes000000 update at some point? Do I have a visual collage only if I make a Resume point?
- Most importantly, HOW can I make a resume point? How do you come up with the Network file, is it from last training? Where can I find it? What about the Resume folder? Is the SNAP number meaning that after tick 10 it will make a resume point? This will create the fakes000100.png file?
- WHEN is training to be considered terminated? Is it based on how many pics you feed the dataset, more pics= longer training or simple/similar pics=quick training?
- After training is finished, this means I will get a .pkl trained file and I can generate GANs from here locally on my computer?
- In the "Perform initial training" there is a second script to run "...-- snap 25 -- resume..." what is that? You don't mention it in the video.
It would be great if you could mange to make a tutorial from start to end including resuming and pauses and issues on the network (cutting out waiting times obviously) in order to make the process more clear for newbies like me. Thanks so much. You're being incredibly helpful.
I like at least a couple thousand images. Many of the NVIDIA labs projects use 50K+. The individual files do not update, but there will be further fakesxxxxxx files generated, that is how I track it. You can resume, critical, I do that in the video, just restart from your last checkpoint. Just make sure you are checkpointing often enough that one happens BEFORE colab shuts you down. Training is "done" when you reach the specified number of kimg, however, you can stop as soon as the images look good, that is what I generally do. I will think about a start to end video! Good idea.
@@HeatonResearch thanks Jeff, so very helpful! Eternally grateful for this.
thanks Jeff, great video. Quick question, training stops after the first tick (duration of training 4 mins on google colab T4). Any ideas on what might be the problem ? no metrics are displayed and no errors either...
Hi Jeff, love your videos and love the way you teach. A question, the IMAGE_PATH variables is set to "fish" directory - is that correct? Shouldn't it point to "circuit" directory, since that is the one we are working on Or am I missing anything?
It would point to whatever directory has your source images. Fish and circuits are both image sets that I've used in the past.
Amazing! thanks so much! First time exploring training GANs and you made it so clear and easy to understand :) one question though, I trained it on a data set of black glyphs on white background (images are 256X256 RGB), I was expecting the generated images to look somewhat like the glyphs, but they were RGB shapeless blobs, is there anything I can do to generate images more similar to the training dataset? thanks!
Thanks for this video Jeff! Can i ask you a question? When you run this training - I'm not really clear on what pkl file is being used initially as a base? Or are you training this model from absolute scratch? If you're training from absolute scratch is there a way to start with an existing pkl i'd like to use and where would I input this? (say ffhq model or wikiart if I were to do something more organic). Thanks so much!
If you do not specify a pkl file with --resume, a neural network with a random weight initialization is used. So yes, in this case I was going from absolute scratch.
@@HeatonResearch Thanks so much - makes sense! Can i ask you one more question? Once I started the training it says it's Training for 25000 kimg . As I understand it it's 25000 images. What does that mean exactly? As you mentioned your dataset was around 3000 images or so and mine I'm doing is around 2000. What does 25000 mean - where's that coming from or will the model turn my dataset number of images into 25000 internally?
So, it randomly samples from your images. Plus, due to ADA, it is also creating new images from your training set. Your training data could be only a couple thousand images, yet with ADA there may be millions of images in total. The kimg is how many real images the discriminator has seen (including augmented images). Also its K, so 100Kimg is 100 * 1000 images. Normally you think of Epochs (number of complete cycles over the training set). But with augmentation epoch is either meaningless or arbitrary.
@@HeatonResearch Jeff - Thank you! When generating images after the training is done - What is the max value I can input for the seed variable? I assume the minimum is 0? or can it be a negative number as well?
I'm new to the whole styleGAN topic and was super happy to find such a well prepared notebook and tutorial. Unfortunately after I perform the Initial Training and the first Tick starts to run it crashes with the following warnings. Any idea how to fix this?
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return forward_call(*input, **kwargs)
/usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown
len(cache))
im also getting this messge, no idea how to fix :/
Incredibly useful video, thanks for real. I just have a question, when using the 'resume' snippet part we just call the 'train.py' again. This got me thinking if the resume works exactly from where the last network snapshot pickle file stopped, or it kind ofmakes a new training instance on it? Sort of like building another layer, instead of continuing the construction?
@
Jeff Heaton Would you mind showing how to run a jupyter notebook like this on AWS/Google Compute Engine if you want to move away from Colab?
Working on doing a spot instance tutorial. I've been using spot for awhile now, it is surprisingly stable and 1/3 the cost.
Can you do the same tutorial for StyleGAN3?
Here you have my like good man!.. How many images did you used for your chips&boards dataset?
Thanks for the like! It was small, only around 2.5k.
Hi Jeff, love this video but I'm having an issue when I try and pair notebook with my dataset, I'm getting a message that reads 'No such file or directory' when I try and run the 1s command, and the link itself doesnt work. Any idea why this is ?
Hi there Jeff. So I've got this running, but it only seems to run for about five minutes and do a single tick. The log looks like this:
tick 0 kimg 0.0 time 50s sec/tick 13.6 sec/kimg 3410.17 maintenance 36.6 cpumem 5.38 gpumem 11.32 augment 0.000
Evaluating metrics...
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return forward_call(*input, **kwargs)
Admittedly I'm only trying this on 25 pictures - could that be the problem?
Hi! Thanks a lot for your videos, so helpful. I've only one problem that's pulling me crazy. I've imported my personal folders from drive, when i modify the path for resuming the training another folder is created in "experiment" with a linear-growing number but the .pkl file remain with the same name everytime, so i believe it leads to a sort of loop in the training process... What would be the most probable error? Hope to have explained clearly the problem, thank you a lot again
We want 4k Now😍
Hey Jeff ! Thank you so much for this video. I'm trying to train NVIDIA StyleGAN3 on my own dataset but i would like to add pictures as input. Can it be done with StyleGAN ? I'm trying to generate a picture of a flowers using two other flowers (what will be the result of this fusion). Of course I have a dataset with the two flowers parents & the child. Do you think is there any way to do this with this tool ?
Hey is it possible to Add a description of your specific image requirements? First I got an error that they're all should be squared, so I cropped all of them, now I get Error: Image width/height after scale and crop are required to be power-of-two
0% 0/250 [00:00
Hey I'm getting "derivative for aten::grid_sampler_2d_backward is not implemented" error, do you know how to fix it?
same problem, bc colab does not allow to use torch version 0.8 anymore
Hi, why do I get this error: AttributeError: 'Logger' object has no attribute 'isatty', when I perform initial training. And I don't see any pkl files on my drive
Hey Jeff, would love to try this but I can't find the link to your notebook. Am I missing something?
Thank you so much for this.
Please how do you know when to stop the training process
Were you able to figure this out?
@@petern1575 no, i just stopped it after some days
Hi ! This is an interesting video ! I try to run your notebook but I have this Error while converting images : "Image width/height after scale and crop are required to be power-of-two". So I look to the .py file and the error comes up because : width != 2 ** int(np.floor(np.log2(width))) . This equation is so weird .... Can you give us the image shape of your dataset so I can resize my images. Thanks a lot !
I have kept it training for over a month now, sometimes 1 get p-100 but mostly I get v-100 but how much more will I have to keep training it. Can @Jeff Heaton please say me minimum of how many days and maximum of how many days?
can the same be applied to stylegan3? because the version of pytorch in stylegan2 is no longer supported by Colab…
Hello, when I launch the initial training it gives me a problem when it tries to construct a network: Constructing networks...
Setting up PyTorch plugin "bias_act_plugin"... Done.
/content/stylegan2-ada-pytorch/torch_utils/ops/conv2d_gradfix.py:55: UserWarning: conv2d_gradfix not supported on PyTorch 1.10.0+cu111. Falling back to torch.nn.functional.conv2d().
warnings.warn(f'conv2d_gradfix not supported on PyTorch {torch.__version__}. Falling back to torch.nn.functional.conv2d().')
Why is this uncompatible? I am using the same stuff as you
Please Jeff, can you tell me how to lower the learning rate from default 0.002 to 0.001 in order to get a more stable lossless training?
Hi Mr. Jeff, I want to ask that I follow your video step by step. However, the command line stops at the initial fake images, and it cannot update so far (i didn't get fakes_00100 or something like that), so what should I do? Thank you!
im stuck, i hit copy for the auth code....it copies from the google sign in page.
and then it wont paste into the text field....just will not paste...
how?
ive looked online but no solution, is it a keyboard shortcut. its a keyboard shortcut, panic over....
it will only paste in properly if you put it in notepad first if copied from a docx....
How many days training is needed? If tesla v100 gpu is used..?
I found that using a third neural network with the objective to keep the training balanced between the generator and the discriminator by augmenting both generated and real images, works very well. setups that would otherwise get stable very fast is very stable now. the "augmenter" the neural network that keeps the game stable only has 1 layer with a kernel size of 3 or 5. is it's objective is to keep the loss of the generator at around 0.65.
Can you post a DL link to the circuit images for us that can't pull from the FlickrAPI?
Hi ! Finally it s works wuth 256x256 ! Thanks to this video I can do my work ! Does anyone know how to save and load the model 's weights after training ? so I can re use the model after. Thanks !
Is there some code for running StyleGAN-ADA with other resolutions than powers of twos? Doesn't matter if it's on Tensorflow or PyTorch. I saw there are such options for StyleGAN2 but haven't found anything working for StyleGAN-ADA. I have a dataset with 32x48 and it feels like a huge waste to pad those images to 64x64. Also, have you used transfer learning for StyleGAN-ADA, and if you did do you have some tips on it? Like which model is best to transfer learn from? In the paper they said that similarity between datasets doesn't matter that much with transfer learning but the diversity of the datasets does.
That is a much deeper drive, but something I am investigating. I would love to do GAN at a normal video aspect ratio, such as HD or 4K. It would require rearchitecting the core StyleGAN neural network, and likely at least one of their CUDA functions.
I've done the initial training but I can't find the pkl files to put in next (they aren't at the experiments folder). Am I doing something wrong?
Sir after training i got blurred images even my image dataset is in 512 resolution please suggest why generated images are blurred
Hi, for the image size, Can I have width and height not the same. Example 256 * 512.
So I followed this (I'm using the free version of Colab) and my session crashed because it ran out of GPU space. Any tips on how to prevent this?
I have run the notebook for days but no generation of results and moreover now it's showing some warning and i am not able to continue training... Please answer anyone
Thanks a lot for the video, it has clarify me a lot of doubts more beeing very knew in this gan world. I have been trying to perform my initial train with your notebook, the first steps are running perfect, also I have another notebook to clean, verify, crop and slice my images, but I´m having problems with the initial train. It starts the first tick (tick 0 kimg 0.0 time 1m 32s sec/tick 17.2 sec/kimg 4308.67 maintenance 74.5 cpumem 4.74 gpumem 11.32 augment 0.000), but then when it starts to evaluate the metrics I have this message that I not really sure how to fix it or where I can change the definition of the workers:
Evaluating metrics...
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return forward_call(*input, **kwargs)
/usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown
len(cache))
I know is very hard to review each doubt that users have, but maybe if you can give a line where I can start or what I can do in this cases, I´ll thank you alot. :)
Were you able to fix this? I found online people suggesting changing batch size will help, but that didn't solve it for me.
Please how are the predictions made after the training
Dude I need to create transformer neural network based chatbot for project submission in 15 days. What should I do? Learn or use to make my project?🙏🙏🙏
Probably learn transformers.
Is it possible to train a GAN based on another already trained one? Like, if I take the one for faces (there are a few publically available, as I know), and then add more face photos, let's say, some fantasy creatures, would that work?
Yes, transfer learning often helps. This is also how you resume training. Use --resume
@@HeatonResearch cool! But what if the new image set is very very different from the first one? Like, you trained it on fish, and then re-trained it on people faces. Will it make any sense, make it better or worse in any way?
It is 2,500 US/month for full 1 x VT100 on Google or AWS :) I have run for 4,000 ticks now and FiDFUll50K still somewhere 122... :( custom dataset 1m images furniture
Jeff I've occasionally seen P100 in the free colab instances.
Interesting, thanks! I am sure they are all slowly marching up. Can't wait to spot an a100 in the wild, on pro, wonder how long that will take?
Generating gan link not working...
how to generate the gan videos? do you have a notebook?
What about the pytorch version?
Is it already installed in collab?
No, Colab does not have this installed by default. It does take some setup, and PyTorch does sometimes introduce breaking changes for Stylegan.
Thanks you for making a Great tutorial series of StyleGAN2 ADA.
But I have some question about resume training, I want to know how to set the resume train from latest kimg. Can I set it at --kimg?
For example
1. I have done my 10 snapshot and get network-xxxxx-0040.pkl
2. I resume on network-xxxxx-0040.pkl
3. But It start to train at 0 kimgs
So I want to know how to set it start at 40 kimgs
Thanks for anyone who answer this. (Sorry for my bad English :( )
I keep resuming but it seems like it starts over but at the same time it looks slightly better. Is this normal?
Please help, i am using colab but the model wont load
Training options:
{
"num_gpus": 1,
"image_snapshot_ticks": 10,
"network_snapshot_ticks": 10,
"metrics": [
"fid50k_full"
],
"random_seed": 0,
"training_set_kwargs": {
"class_name": "training.dataset.ImageFolderDataset",
"path": "/content/drive/MyDrive/data/gan/images/",
"use_labels": false,
"max_size": 1920,
"xflip": true,
"resolution": 256
},
"data_loader_kwargs": {
"pin_memory": true,
"num_workers": 1,
"prefetch_factor": 2
},
"G_kwargs": {
"class_name": "training.networks.Generator",
"z_dim": 512,
"w_dim": 512,
"mapping_kwargs": {
"num_layers": 2
},
"synthesis_kwargs": {
"channel_base": 16384,
"channel_max": 512,
"num_fp16_res": 4,
"conv_clamp": 256
}
},
"D_kwargs": {
"class_name": "training.networks.Discriminator",
"block_kwargs": {},
"mapping_kwargs": {},
"epilogue_kwargs": {
"mbstd_group_size": 4
},
"channel_base": 16384,
"channel_max": 512,
"num_fp16_res": 4,
"conv_clamp": 256
},
"G_opt_kwargs": {
"class_name": "torch.optim.Adam",
"lr": 0.0025,
"betas": [
0,
0.99
],
"eps": 1e-08
},
"D_opt_kwargs": {
"class_name": "torch.optim.Adam",
"lr": 0.0025,
"betas": [
0,
0.99
],
"eps": 1e-08
},
"loss_kwargs": {
"class_name": "training.loss.StyleGAN2Loss",
"r1_gamma": 0.8192
},
"total_kimg": 25000,
"batch_size": 16,
"batch_gpu": 16,
"ema_kimg": 5.0,
"ema_rampup": 0.05,
"ada_target": 0.6,
"augment_kwargs": {
"class_name": "training.augment.AugmentPipe",
"xflip": 1,
"rotate90": 1,
"xint": 1,
"scale": 1,
"rotate": 1,
"aniso": 1,
"xfrac": 1,
"brightness": 1,
"contrast": 1,
"lumaflip": 1,
"hue": 1,
"saturation": 1
},
"run_dir": "/content/drive/MyDrive/data/gan/experiments/00004--mirror-auto1-ada"
}
Output directory: /content/drive/MyDrive/data/gan/experiments/00004--mirror-auto1-ada
Training data: /content/drive/MyDrive/data/gan/images/
Training duration: 25000 kimg
Number of GPUs: 1
Number of images: 1920
Image resolution: 256
Conditional model: False
Dataset x-flips: True
Creating output directory...
Launching processes...
Loading training set...
Num images: 3840
Image shape: [3, 256, 256]
Label shape: [0]
Constructing networks...
Setting up PyTorch plugin "bias_act_plugin"... Done.
Setting up PyTorch plugin "upfirdn2d_plugin"... Done.
Generator Parameters Buffers Output shape Datatype
--- --- --- --- ---
mapping.fc0 262656 - [16, 512] float32
mapping.fc1 262656 - [16, 512] float32
mapping - 512 [16, 14, 512] float32
synthesis.b4.conv1 2622465 32 [16, 512, 4, 4] float32
synthesis.b4.torgb 264195 - [16, 3, 4, 4] float32
synthesis.b4:0 8192 16 [16, 512, 4, 4] float32
synthesis.b4:1 - - [16, 512, 4, 4] float32
synthesis.b8.conv0 2622465 80 [16, 512, 8, 8] float32
synthesis.b8.conv1 2622465 80 [16, 512, 8, 8] float32
synthesis.b8.torgb 264195 - [16, 3, 8, 8] float32
synthesis.b8:0 - 16 [16, 512, 8, 8] float32
synthesis.b8:1 - - [16, 512, 8, 8] float32
synthesis.b16.conv0 2622465 272 [16, 512, 16, 16] float32
synthesis.b16.conv1 2622465 272 [16, 512, 16, 16] float32
synthesis.b16.torgb 264195 - [16, 3, 16, 16] float32
synthesis.b16:0 - 16 [16, 512, 16, 16] float32
synthesis.b16:1 - - [16, 512, 16, 16] float32
synthesis.b32.conv0 2622465 1040 [16, 512, 32, 32] float16
synthesis.b32.conv1 2622465 1040 [16, 512, 32, 32] float16
synthesis.b32.torgb 264195 - [16, 3, 32, 32] float16
synthesis.b32:0 - 16 [16, 512, 32, 32] float16
synthesis.b32:1 - - [16, 512, 32, 32] float32
synthesis.b64.conv0 1442561 4112 [16, 256, 64, 64] float16
synthesis.b64.conv1 721409 4112 [16, 256, 64, 64] float16
synthesis.b64.torgb 132099 - [16, 3, 64, 64] float16
synthesis.b64:0 - 16 [16, 256, 64, 64] float16
synthesis.b64:1 - - [16, 256, 64, 64] float32
synthesis.b128.conv0 426369 16400 [16, 128, 128, 128] float16
synthesis.b128.conv1 213249 16400 [16, 128, 128, 128] float16
synthesis.b128.torgb 66051 - [16, 3, 128, 128] float16
synthesis.b128:0 - 16 [16, 128, 128, 128] float16
synthesis.b128:1 - - [16, 128, 128, 128] float32
synthesis.b256.conv0 139457 65552 [16, 64, 256, 256] float16
synthesis.b256.conv1 69761 65552 [16, 64, 256, 256] float16
synthesis.b256.torgb 33027 - [16, 3, 256, 256] float16
synthesis.b256:0 - 16 [16, 64, 256, 256] float16
synthesis.b256:1 - - [16, 64, 256, 256] float32
--- --- --- --- ---
Total 23191522 175568 - -
Discriminator Parameters Buffers Output shape Datatype
--- --- --- --- ---
b256.fromrgb 256 16 [16, 64, 256, 256] float16
b256.skip 8192 16 [16, 128, 128, 128] float16
b256.conv0 36928 16 [16, 64, 256, 256] float16
b256.conv1 73856 16 [16, 128, 128, 128] float16
b256 - 16 [16, 128, 128, 128] float16
b128.skip 32768 16 [16, 256, 64, 64] float16
b128.conv0 147584 16 [16, 128, 128, 128] float16
b128.conv1 295168 16 [16, 256, 64, 64] float16
b128 - 16 [16, 256, 64, 64] float16
b64.skip 131072 16 [16, 512, 32, 32] float16
b64.conv0 590080 16 [16, 256, 64, 64] float16
b64.conv1 1180160 16 [16, 512, 32, 32] float16
b64 - 16 [16, 512, 32, 32] float16
b32.skip 262144 16 [16, 512, 16, 16] float16
b32.conv0 2359808 16 [16, 512, 32, 32] float16
b32.conv1 2359808 16 [16, 512, 16, 16] float16
b32 - 16 [16, 512, 16, 16] float16
b16.skip 262144 16 [16, 512, 8, 8] float32
b16.conv0 2359808 16 [16, 512, 16, 16] float32
b16.conv1 2359808 16 [16, 512, 8, 8] float32
b16 - 16 [16, 512, 8, 8] float32
b8.skip 262144 16 [16, 512, 4, 4] float32
b8.conv0 2359808 16 [16, 512, 8, 8] float32
b8.conv1 2359808 16 [16, 512, 4, 4] float32
b8 - 16 [16, 512, 4, 4] float32
b4.mbstd - - [16, 513, 4, 4] float32
b4.conv 2364416 16 [16, 512, 4, 4] float32
b4.fc 4194816 - [16, 512] float32
b4.out 513 - [16, 1] float32
--- --- --- --- ---
Total 24001089 416 - -
Setting up augmentation...
Distributing across 1 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
2021-07-07 07:40:41.581569: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Training for 25000 kimg...
tick 0 kimg 0.0 time 38s sec/tick 7.0 sec/kimg 435.39 maintenance 30.7 cpumem 4.21 gpumem 12.63 augment 0.000
Evaluating metrics...
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return forward_call(*input, **kwargs)
/usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 28 leaked semaphores to clean up at shutdown
len(cache))
I had the same problem before, then I realized that is the pytorch difference. You have to use tytorch 1.7.1
Hi, I think it would be possible to get the server time and save a checkpoint one hour before shutdown.
That would require a change to StyleGAN, since it checkpoints based on number of steps, rather than anything related to time. But agree that would be optimal.
Sir great tutorial but having issue with a image with error
Inconsistant color format: /content/drive/MyDrive/data/gan/images/test/1.png
Probably it is a grayscale image.
sir I am trying to train a style GAN2 but it stops after 1 tick only
i have the same problem! Let me know if you could find a solution please:)
@@abcdefg-zl7ew yes I solved the problem
@@abcdefg-zl7ew what is the size of your dataset
@@deewakarchakraborty4027 around 4000 pictures
@@abcdefg-zl7ew try modifying batch size, whatever batch you have its unable to load it into memory and thus system crashes
How can I add fine tunning to fake images?
I have not tried that.
4k 4k 4k 4k, pleeeaaaase !
must the images be JPEGS or could they be PNGs?
Either works fine for input, but the stylegan image converter will convert them to PNG.
When do I stop training?
I have a question. does it have to be a square? Thank you.
StyleGAN2 ADA is built to process square images. GANs in general do not have such a requirement. It would take considerable modification to StyleGAN2 to process non-square images; additionally, considerable inefficiencies would be introduced if the images were non-square. Additionally, the way StyleGAN2 is built, the dimensions MUST be a multiple of 2. So you jump right from 1024 to 2048, to 4096 to 8192. etc.
@@HeatonResearch great thanks!!!
Looking for a solution to get a 4K GAN
Not sure what to do
does not work anymore on colab..
colab pro diskspace size please ?
Nearly the same as non-pro.
there is a torch uncompatibility.
I wonder if they allow to train it on nsfw images 😈
lol, yes that has been done. Just google GAN p*rn.
Seems there's a new issue. I was able to recreate it on a fresh google account, notebook, images and the issue persists.
File "/usr/local/lib/python3.7/dist-packages/tensorboard/compat/__init__.py", line 42, in tf
from tensorboard.compat import notf # noqa: F401
ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.7/dist-packages/tensorboard/compat/__init__.py)
During handling of the above exception, another exception occurred:
File "/usr/local/lib/python3.7/dist-packages/jax/_src/pretty_printer.py", line 54, in _can_use_color
return sys.stdout.isatty()
AttributeError: 'Logger' object has no attribute 'isatty'
I'm too noob to figure out what's wrong but uninstalling tensorboard seems to allow training to continue. (Skipping tfevents export: No module named 'tensorboard')
Funny story is I paid for colab pro+ while I was training with the free GPU and only got the error when I tried training with my pro+ plan.
I am also facing the same issue. Did u resolve it?
@@shashankpalle5409 In the install StyleGAN2 tab add a # before it installs tensorboard
#!pip install tensorboard==1.14.0
I'm not an expert so I don't know how to fix it properly.
@@shashankpalle5409 I did some more testing and it appears my previous suggestion does not work and you have to manually uninstall tensorboard.
!pip uninstall tensorboard.
Or, you can change the version to something that stylegan will ignore like tensorboard==1.14.1
@@veazix Thank you so much!!! I have spent way too long trying all kinds of things to get my copy of Jeff's notebook to work - and your solution to uninstall tensorboard solved the problem :) [It's weird though, because up until about 5 days ago, I had run the notebook at least 20 times with no problems. Then something changed, which your solution fixed] Thank, again!
Thanks a lot @Devdabomber it solved the issue.
I got this huge error list when trying to train my model for the first time.
ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.7/dist-packages/tensorboard/compat/__init__.py)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/content/stylegan2-ada-pytorch/train.py", line 538, in
main() # pylint: disable=no-value-for-parameter
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "/content/stylegan2-ada-pytorch/train.py", line 531, in main
subprocess_fn(rank=0, args=args, temp_dir=temp_dir)
File "/content/stylegan2-ada-pytorch/train.py", line 383, in subprocess_fn
training_loop.training_loop(rank=rank, **args)
File "/content/stylegan2-ada-pytorch/training/training_loop.py", line 240, in training_loop
stats_tfevents = tensorboard.SummaryWriter(run_dir)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/tensorboard/writer.py", line 220, in __init__
self._get_file_writer()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/tensorboard/writer.py", line 251, in _get_file_writer
self.flush_secs, self.filename_suffix)
File "/usr/local/lib/python3.7/dist-packages/torch/utils/tensorboard/writer.py", line 61, in __init__
log_dir, max_queue, flush_secs, filename_suffix)
File "/usr/local/lib/python3.7/dist-packages/tensorboard/summary/writer/event_file_writer.py", line 72, in __init__
tf.io.gfile.makedirs(logdir)
File "/usr/local/lib/python3.7/dist-packages/tensorboard/lazy.py", line 65, in __getattr__
return getattr(load_once(self), attr_name)
File "/usr/local/lib/python3.7/dist-packages/tensorboard/lazy.py", line 97, in wrapper
cache[arg] = f(arg)
File "/usr/local/lib/python3.7/dist-packages/tensorboard/lazy.py", line 50, in load_once
module = load_fn()
File "/usr/local/lib/python3.7/dist-packages/tensorboard/compat/__init__.py", line 45, in tf
import tensorflow
File "/usr/local/lib/python3.7/dist-packages/tensorflow/__init__.py", line 51, in
from ._api.v2 import compat
File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/__init__.py", line 37, in
from . import v1
File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/v1/__init__.py", line 30, in
from . import compat
File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/v1/compat/__init__.py", line 37, in
from . import v1
File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/v1/compat/v1/__init__.py", line 47, in
from tensorflow._api.v2.compat.v1 import lite
File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/v1/lite/__init__.py", line 9, in
from . import experimental
File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/v1/lite/experimental/__init__.py", line 8, in
from . import authoring
File "/usr/local/lib/python3.7/dist-packages/tensorflow/_api/v2/compat/v1/lite/experimental/authoring/__init__.py", line 8, in
from tensorflow.lite.python.authoring.authoring import compatible
File "/usr/local/lib/python3.7/dist-packages/tensorflow/lite/python/authoring/authoring.py", line 43, in
from tensorflow.lite.python import convert
File "/usr/local/lib/python3.7/dist-packages/tensorflow/lite/python/convert.py", line 29, in
from tensorflow.lite.python import util
File "/usr/local/lib/python3.7/dist-packages/tensorflow/lite/python/util.py", line 51, in
from jax import xla_computation as _xla_computation
File "/usr/local/lib/python3.7/dist-packages/jax/__init__.py", line 59, in
from .core import eval_context as ensure_compile_time_eval
File "/usr/local/lib/python3.7/dist-packages/jax/core.py", line 47, in
import jax._src.pretty_printer as pp
File "/usr/local/lib/python3.7/dist-packages/jax/_src/pretty_printer.py", line 56, in
CAN_USE_COLOR = _can_use_color()
File "/usr/local/lib/python3.7/dist-packages/jax/_src/pretty_printer.py", line 54, in _can_use_color
return sys.stdout.isatty()
AttributeError: 'Logger' object has no attribute 'isatty'
Also I can't seem to find any pkl files, I don't think the program created it for some reason.
same error here
i fixed it by running the following:
!pip uninstall jax jaxlib
!pip install jax[cpu]==0.3.10
@@cheaddaca3532 I see, when and where should I add the cell to run this?
@@thegoldenboy4681 before you run the initial training
@@cheaddaca3532 Thank you