I am working with Keras and happens to me to leave the jupyter notebook running overnight and close the browser. When I reopen it, the notebook gets stuck in the (Starting) phase and while the kernel is still running I cannot see any update in the output cells. Leaving the browser open doesn't help because eventually the output freezes anyway..
I don't know the exact memory taken by the notebook, but it's a con with several layers so I bet quite heavy. I am working on Ubuntu with Firefox.
Any help is appreciated, thanks!
Had the same issue & here is the workaround that i have found:
Use the callback functionality to log the intermidiate results of the model into a CSV:
from keras.callbacks import CSVLogger
csv_logger = CSVLogger('training.log')
model.fit(X_train, Y_train, callbacks=[csv_logger])
Also, add the %% capture at the top of the cell where you are calling the fit function, so that no output is not displayed in the Jupyter at all.
Related
I've just got an M1 MacBook Pro and have been working on getting my development environment set up. I have followed instructions to install PyTorch with mini-forge so that it will run natively.
To test this, I've been running the MNIST example code from here. The code runs fine and Activity Monitor shows me that it is running natively. However, something still seems to be wrong: when the exact same code is executed on my other machine accuracy of 99% is achieved usually after one epoch, and test loss is very low. On the M1 Macbook, although training loss goes down, test loss never drops below 2 and accuracy never exceeds 10% after 14 epochs.
Similar issues occurred when running other code, e.g. this produces a loss of nan at every step.
I haven't been able to find this issue mentioned anywhere else: PyTorch is running natively on the M1, but it just doesn't seem to be working correctly.
I'd appreciate any suggestions as to what might be causing this or how I might fix it.
Thanks!
In using tensorboard I have cleared my data directory and trained a new model but I am seeing images from an old model. Why is tensorboard loading old data, where is it being stored, and how do I remove it?
Tensorboard was built to have caches in case long training fails you have "bak"-like files that your board will generate visualizations from. Unfortunately, there is not a good practice to manually remove hidden temp files as they are not seen from displaying files including ones with the . (dot) prefix using bash. This memory is self-managed. For best practices, (1) have your tensorboard name be dynamic for results of each run: this can be done using datetime library in combination with an f-string in python so that the name of each run is separated by a time stamp. (This command be done right from python, say a jupyter notebook, if you import the subprocess package and run your bash command straight from the script.) (2) Additionally, you are strongly advised to save your logdir (log directory) separately from where you are running the code. These two practices together should solve all the problems related to tmp files erroneously populating new results.
How to "reset" tensorboard data after killing tensorflow instance
I ran a reinforcement learning training script which used Pytorch and logged data to tensorboardX and saved checkpoints. Now I want to continue training. How do I tell tensorboardX to continue from where I left off? Thank you!
I figured out how to continue the training plot. While creating the summarywriter, we need to provide the same log_dir that we used while training the first time.
from tensorboardX import SummaryWriter
writer = SummaryWriter('log_dir')
Then inside the training loop step needs to start from where it left (not from 0):
writer.add_scalar('average reward',rewards.mean(),step)
Since I am kind of new in this field I tried following the official tutorial from tensorflow for predicting time series. https://www.tensorflow.org/tutorials/structured_data/time_series
Following problem occurs:
-When training a multivariate model, after 2 or 3 epochs the kernel dies and restarts.
However this doesn't happen with a simpler univariate model, which has only one LSTM layer (not really sure if this makes a difference).
Second however, this problem just happened today. Yesterday the training of the multivariate model was possible and error-free.
As can be seen in the tutorial in the link below the model looks like this:
multi_step_model = tf.keras.models.Sequential()
multi_step_model.add(tf.keras.layers.LSTM(32,return_sequences=True,input_shape=x_train_multi.shape[-2:]))
multi_step_model.add(tf.keras.layers.LSTM(16, activation='relu'))
multi_step_model.add(tf.keras.layers.Dense(72))
multi_step_model.compile(optimizer=tf.keras.optimizers.RMSprop(clipvalue=1.0), loss='mae')
And the kernel dies after executing the following cell (usually after 2 or 3 epochs).
multi_step_history = multi_step_model.fit(train_data_multi, epochs=10,
steps_per_epoch=300,
validation_data=val_data_multi,
validation_steps=50)
I have uninstalled and reinstalled tf, restarted my laptop, but nothing seems to work.
Any ideas?
OS: Windows 10
Surface Book 1
Problem was a too big batch size. Reducing it from 1024 to 256 solved the crashing problem.
Solution taken from the comment of rbwendt on this thread on github.
I am using Pytorch on Windows 10 OS, and having trouble understanding the correct use of Pytorch TensorboardX.
After instantiating a writer (writer = SummaryWriter()), and adding the value of the loss function (in every iteration) to it (write.add_scalar('data/loss_func', loss.data[0].item(), iteration)), I have a folder which contains the saved run.
My questions are:
1) As far as I understand, after the training is complete, I need to write in the terminal (which corresponds to the command line prompt in Windows):
tensorboard --logdir=C:\Users\DrJohn\Documents\runs
where this is the folder which contains the file created by tensorboardX. What is the valid syntax in Windows command prompt? I couldn't understand this from the online tutorials
2) Is it possible to see the learning during the training, by using tensorboardX? (i.e. to plot the learning curve during the iterations?)Is the only option is to see everything once the training ends?
Thanks in advance