where could I find training.pt / test.pt - pytorch

In the Pytorch docs for MNIST I read:
root (string): Root directory of dataset where MNIST/processed/training.pt
and MNIST/processed/test.pt exist.
Where could I find these two files traing.pt, test.pt? And what are their format?

Assuming pytorch 1.x+, The constructor of torchvision.datasets.MNIST follows this signature:
torchvision.datasets.MNIST(root, train=True, transform=None, target_transform=None, download=False)
The easiest way to get the dataset is to set download=True, that way it will automatically download and store training.pt and test.pt. Assuming a local install, it will by default store them somewhere like .local/lib/python3.6/site-packages/torchvision/, although you don't have to worry about that.

Related

Spliting an image dataset

I have an image dataset (folder of jpg images) .I would like to split it : 70 % for train and 30% for test randomly.
So I write this simple script:
from sklearn.model_selection import train_test_split
path = ".\dataset"
output_split=train_test_split(path,path,test_size=0.2)
But I don't find anything in folder "output_split"
So where I store output of spliting (train and test)?
I recommend using tf.dataset in your project. You can reference this link here for documentation:
https://www.tensorflow.org/api_docs/python/tf/keras/utils/image_dataset_from_directory
And this link for a example of the function use:
https://keras.io/api/preprocessing/image/
Please remember to use the same seed for both training and (testing/validation).
Please note that no changes will happen to your files in your local.

How To Import The MNIST Dataset From Local Directory Using PyTorch

I am writing a code of a well-known problem MNIST database of handwritten digits in PyTorch. I downloaded the train and testing dataset (from the main website) including the labeled dataset. The dataset format is t10k-images-idx3-ubyte.gz and after extract t10k-images-idx3-ubyte. My dataset folder looks like
MINST
Data
train-images-idx3-ubyte.gz
train-labels-idx1-ubyte.gz
t10k-images-idx3-ubyte.gz
t10k-labels-idx1-ubyte.gz
Now, I wrote a code to load data like bellow
def load_dataset():
data_path = "/home/MNIST/Data/"
xy_trainPT = torchvision.datasets.ImageFolder(
root=data_path, transform=torchvision.transforms.ToTensor()
)
train_loader = torch.utils.data.DataLoader(
xy_trainPT, batch_size=64, num_workers=0, shuffle=True
)
return train_loader
My code is showing Supported extensions are: .jpg,.jpeg,.png,.ppm,.bmp,.pgm,.tif,.tiff,.webp
How can I solve this problem and I also want to check that my images are loaded (just a figure contains the first 5 images) from the dataset?
Read this Extract images from .idx3-ubyte file or GZIP via Python
Update
You can import data using this format
xy_trainPT = torchvision.datasets.MNIST(
root="~/Handwritten_Deep_L/",
train=True,
download=True,
transform=torchvision.transforms.Compose([torchvision.transforms.ToTensor()]),
)
Now, what is happening at download=True first your code will check at the root directory (your given path) contains any datasets or not.
If no then datasets will be downloaded from the web.
If yes this path already contains a dataset then your code will work using the existing dataset and will not download from the internet.
You can check, first give a path without any dataset (data will be downloaded from the internet), and then give another path which already contains dataset data will not be downloaded.
Welcome to stackoverflow !
The MNIST dataset is not stored as images, but in a binary format (as indicated by the ubyte extension). Therefore, ImageFolderis not the type dataset you want. Instead, you will need to use the MNIST dataset class. It could even download the data if you had not done it already :)
This is a dataset class, so just instantiate with the proper root path, then put it as the parameter of your dataloader and everything should work just fine.
If you want to check the images, just use the getmethod of the dataloader, and save the result as a png file (you may need to convert the tensor to a numpy array first).

using keras's .flow_from_directory() on mounted s3 bucket in databricks

I am trying to build a convolutional neural network in databricks using Spark 2.4.4 and a Scala 2.11 backend in python. I have build CNN's before but this is my first time with using Spark(databricks) and AWS s3.
The files in AWS are oredered like this:
train_test_small/(train or test)/(0,1,2 or 3)/
And then a list of images in every directory corresponding to their category(0,1,2,3)
In order to access my files stored in the s3 bucket I mounted the bucket to databricks like this:
# load in the image files
WS_BUCKET_NAME = "sensored_bucket_name/video_topic_modelling/data/train_test_small"
MOUNT_NAME = "train_test_small"
dbutils.fs.mount("s3a://%s" % AWS_BUCKET_NAME, "/mnt/%s" % MOUNT_NAME)
display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))
Upon using: display(dbutils.fs.mounts()) I can see the bucket mounted to:
MountInfo(mountPoint='/mnt/train_test_small', source='sensored_bucket_name/video_topic_modelling/data/train_test_small', encryptionType='')
I then try to access this mounted directory through keras's flow_from_directory() module using the following piece of code:
# create extra partition of the training data as a validation set
train_datagen=ImageDataGenerator(preprocessing_function=preprocess_input, validation_split=0) #included in our dependencies
# set scaling to most common shapes
train_generator=train_datagen.flow_from_directory('/mnt/train_test_small',
target_size=(320, 240),
color_mode='rgb',
batch_size=96,
class_mode='categorical',
subset='training')
#shuffle=True)
validation_generator=train_datagen.flow_from_directory('/mnt/train_test_small',
target_size=(320, 240),
color_mode='rgb',
batch_size=96,
class_mode='categorical',
subset='validation')
However this gives me the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/train_test_small/train/'
I tried to figure this out using keras and databricks documentation but got no further. My best guess at the moment right now is that the keras flow_from_directory() is unable to detect mounted directories but I am not sure.
Anyone out there who does know how to apply the .flow_from_directory() module on a s3 mounted directory in databricks or who knows a good alternative? Help would be much appreciated!
I think you may be missing one more directory level indication to the flow_from_directory. From Keras documentation:
directory: string, path to the target directory. It should contain one subdirectory per class. Any PNG, JPG, BMP, PPM or TIF images inside each of the subdirectories directory tree will be included in the generator.
# set scaling to most common shapes
train_generator=train_datagen.flow_from_directory(
'/mnt/train_test_small/train', # <== add "train" folder
target_size=(320, 240),
...
validation_generator=train_datagen.flow_from_directory(
'/mnt/train_test_small/test', # <== add "test" folder
target_size=(320, 240),
....
Answer found. To access the direct path to the folder add /dbfs/mnt/train_test_small/train/

Importing Tensorboard for maskRcnn(Matterport - Mask RCNN)

I am currently trying to implement Mask RCNN by following Matterport repo. I have a doubt regarding implementation of tensorboard.
The dataset is similar to coco dataset. Inside model.py under def train, tensorboard is mentioned as
callbacks = [ keras.callbacks.TensorBoard(log_dir=self.log_dir,histogram_freq=0, write_graph=True, write_images=False)
But What else I should mention for using tensorboard? When I try to run the tensorboard, it say log file not found. I know that there is something I am missing some where!!. Please help me out !
In your model.train() ensure you set custom_callbacks = callbacks parameter.
If you specified these parameters exactly like this, then it means that your issue is that you do not properly open the logs directory.
Open (inside Anaconda/PyCharm) or a separate Python terminal and put the absolute path(to make sure it works):
tensorboard --logdir = my_absolute_path/logs/

Tensorflow object detection API tfrecord

Im new to the tensorflow TFRecord. so Im studying Tensorflow object detection API codes
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md
but I can`t find the codes that load tfrecord.
I think they use .config file to load tfrecord because I found this in config file.
tf_record_input_reader {
input_path: "/path/to/train_dataset.record-?????-of-00010"
}
Anyone can help?
Have you converted your dataset to TFRecord format yet?
If so, you should have a path which contains your training dataset sharded to a few record files, with the format
<path_to_training_data>/<train_dataset>.record-?????-of-xxxxx
Where <path_to_training_data> is the abovementioned path to your training dataset, <train_dataset> is the file name you gave to each file, xxxxx is the number of record files created (e.g. 00010), and ????? should be left as is, and is used as a format to all record files.
Once you've replaced <path_to_training_data>, <train_dataset> and xxxxx to the correct values of your dataset, TF OD API should handle everything else (finding all records, reading them, etc.).
Note there's usually tf_record_input_reader for both training dataset and eval dataset (validation/test), and each should have the corresponding above-mentioned values (path, dataset name, number of files).

Resources