Google Colab is so slow while reading images from Google Drive - keras

I have my own dataset for a deep learning project. I uploaded that into Google Drive and linked it to a Colab page. But Colab could read only 2-3 images in a second, where my computer can dozens of them. (I used imread to read images.)
There is no speed problem with model compiling process of keras, but only with reading images from Google Drive. Does anybody know a solution? Someone suffered of this problem too, but it's still unsolved: Google Colab very slow reading data (images) from Google Drive (I know this is kind of a duplication of the question in the link, but I reposted it because it is still unsolved. I hope this is not a violation of Stack Overflow rules.)
Edit: The code piece that I use for reading images:
def getDataset(path, classes, pixel=32, rate=0.8):
X = []
Y = []
i = 0
# getting images:
for root, _, files in os.walk(path):
for file in files:
imagePath = os.path.join(root, file)
className = os.path.basename(root)
try:
image = Image.open(imagePath)
image = np.asarray(image)
image = np.array(Image.fromarray(image.astype('uint8')).resize((pixel, pixel)))
image = image if len(image.shape) == 3 else color.gray2rgb(image)
X.append(image)
Y.append(classes[className])
except:
print(file, "could not be opened")
X = np.asarray(X, dtype=np.float32)
Y = np.asarray(Y, dtype=np.int16).reshape(1, -1)
return shuffleDataset(X, Y, rate)

I'd like to provide a more detailed answer about what unzipping the files actually looks like. This is the best way to speed up reading data because unzipping the file into the VM disk is SO much faster than reading each file individually from Drive.
Let's say you have the desired images or data in your local machine in a folder Data. Compress Data to get Data.zip and upload it to Drive.
Now, mount your drive and run the following command:
!unzip "/content/drive/My Drive/path/to/Data.Zip" -d "/content"
Simply amend all your image paths to go through /content/Data, and reading your images will be much much faster.

I recommend you to upload your file to GitHub then clone it to Colab. It can reduce my training time from 1 hour to 3 minutes.

Upload zip files to the drive. After transferring to colab unzip them. File copy overhead is cumbersome therefore you shouldn't copy masses of files instead copy a single zip and unzip.

Related

Is there a way to use sklearn.datasets.load_files for image files

Trying to use custom folders with images instead of X, y = sklearn.datasets.load_digits(return_X_y=True) for sklearn image classification tasks.
load_files does what I need, but it seems to be created for text files. Any tips for working with image files, would be appreciated.
I have the image files stored in following structure
DataSet/label1/image1.png
DataSet/label1/image2.png
DataSet/label1/image3.png
DataSet/label2/image1.png
DataSet/label2/image2.png
I had the same task and found this thread: Using sklearn load_files() to load images from png as data
Hopefully, this helps you too.

Dataset Labeled as not found or Corrupt, but the dataset is not corrupt

I have been trying to use this Github (https://github.com/AntixK/PyTorch-VAE) and call the CelebA dataset using the config file listed. Specifically under the vae.yaml I have placed the path of the unzipped file where I have downloaded the celeba dataset (https://www.kaggle.com/jessicali9530/celeba-dataset) on my computer. And every time I run the program, I keep getting these errors:
File "/usr/local/lib/python3.6/dist-packages/torchvision/datasets/celeba.py", line 67, in init
' You can use download=True to download it')
RuntimeError: Dataset not found or corrupted. You can use download=True to download it
AttributeError: 'VAEXperiment' object has no attribute '_lazy_train_dataloader'
I have tried to download the dataset, but nothing changes. So I have no idea why the program is not running.
The run.py calls the experiment.py which uses this dataloader to retrieve the information:
def train_dataloader(self):
transform = self.data_transforms()
if self.params['dataset'] == 'celeba':
dataset = CelebA(root = self.params['data_path'],
split = "train",
transform=transform,
download=False)
else:
raise ValueError('Undefined dataset type')
self.num_train_imgs = len(dataset)
return DataLoader(dataset,
batch_size= self.params['batch_size'],
shuffle = True,
drop_last=True)
The config file grabs the information passed on the root. So what I did was upload a few files to google colab (some .jpg files) and when I run the command stated in the GItHub, python run.py -c config/vae.yaml, it states that the dataset is not found or is corrupt. I have tried this on my linux machine and the same error occurs, even when I used the downloaded and unzip link. I have gone further to attempt to change the self.params['data_path'] to the actual path and that still does not work. Any ideas what I can do?
My pytorch version is 1.6.0.
There are two issues which I have faced. The below is my solution. It is not official but it works for me. Hope the next pytorch version will update it.
Issue: Dataset not found or corrupted.'
When I checked file celeba.py in pytorch library. I found this line:
if ext not in [".zip", ".7z"] and not check_integrity(fpath, md5):
return False
This part will make self._check_integrity() return False and the program provides the message error as we got.
Solve: You can ignore this part by add "if False" immediately in front of this line
if False:
if ext not in [".zip", ".7z"] and not check_integrity(fpath, md5):
return False
celeba.py downloads dataset if you choose download=True but these two files are broken "list_landmarks_align_celeba.txt" and "list_attr_celeba.txt"
You need to find somewhere, download and replace them
Hope these solutions will help you !!!!

How can i Load local images to train model in tensorflow

I am trying to build a CNN to differentiate between a car and a bicycle. However i saw the same example of a horse and a human in the Laurence's example here. But instead of loading the data from some library, i have downloaded close to 5000 images of cars and bicycle and segregated them as the folders suggested in the video. But how to load the local files to train my model? Am trying to use the below code. But it is giving me a file not found exception. Here is the link to my colab am trying to do.
import os
# Directory with our training cycle pictures
train_cycle_dir = os.path.join('C:/Users/User/Desktop/Tensorflow/PrivateProject/Images/training/cycle')
# Directory with our training car pictures
train_car_dir = os.path.join('C:/Users/User/Desktop/Tensorflow/PrivateProject/Images/training/cars')
# Directory with our training cycle pictures
validation_cycle_dir = os.path.join('C:/Users/User/Desktop/Tensorflow/PrivateProject/Images/validation/cycle')
# Directory with our training car pictures
validation_car_dir = os.path.join('C:/Users/User/Desktop/Tensorflow/PrivateProject/Images/validation/cars')
You can't access files which are on your computer directly from colab. If you have enough space on your google drive, you can upload them to your google drive and mount it like here or with the "Mount Drive" button on the files sidebar. There's also a button to upload files from your computer there. But if you upload them to colab, you have to do it again after 12 hours when the runtime resets. (you can read it here)

Downloading S3 files in Google Colab

I am working on a project and it happens that some data is provided in form of S3fileSystem. I can read that data using S3FileSystem.open(path). But there are more than 360 files and it takes atleast 3 minutes to read a single file. I was wondering, is there any way of downloading these files in my system and read them from there, instead of reading it directly from S3fileSystem. There is another reason, although I can read all those files but once my session on colab reconnects I have to re-read all those files again, hence it will take a lot of time. I am using following code to read files
fs_s3 = s3fs.S3FileSystem(anon=True)
s3path = 'file_name'
remote_file_obj = fs_s3.open(s3path, mode='rb')
ds = xr.open_dataset(remote_file_obj, engine= 'h5netcdf')
Is there any way of downloading those files?
You can use another s3fs to mount the bucket, then copy the files to Colab.
how to mount
After mounting, you can
!cp /s3/yourfile.zip /content/

Cloning images with Python 3+Pillow in order to strip hidden metadata

I have a client that created several large PDFs, each containing hundreds of images within them. The images were created with a program that adds unique info to each file; random binary data was placed in some file headers, some files have data disguised as image artifacts, and general metadata in each image. While I'm unfamiliar with the program, I understand that it's a marketing software suite of some sort so I assume the data is used for tracking online distribution and analytics.
I have the source files used to create the PDFs and while I could open each image, clone its visual data, strip metadata and re-compress the images to remove the identification data, I would much rather automate the process using Pillow. The problem is, I'm worried I could miss something. The client is hoping to release the files from behind an online username, and he doesn't want the username tied to this program or its analytical tracking mechanisms.
So my question is this: how would I clone an image with Pillow in a way that would strip all identifying metadata? The image files are massive, ranging from 128MB to 2GB. All of the images are either PNG uncompressed or JPEG files with very mild compression. I'm not married to Pillow, so if there's a better software library (or standalone software) that better suits this, I'll use it instead.
Just use ImageMagick as installed on most Linux distros and available for macOS and Windows. So, in Terminal, strip the metadata from an input file and resave:
magick input.jpg -strip result.jpg
If you want to do all JPEGs in current directory:
magick mogrify -strip *.jpg
Or maybe you want to change the quality a little too:
magick mogrify -quality 82 -strip *.jpg
Copying the pixel data to a new image should remove all metadata and compressing the image slightly as a jpeg should remove steganographic tracking data.
You may have to modify the load/copy/save methods to deal with large files. Also pay attention to the PIL file size limitations. Opacity in png files isn't handled here.
import os
from PIL import Image
picture_dir = ''
for subdir, dirs, files in os.walk(picture_dir):
for f in files:
ext = os.path.splitext(f)[1]
if( ext in ['.jpg','.jpeg','.png'] ):
full_path = os.path.join(subdir, f)
im = Image.open(full_path)
data = list(im.getdata())
no_exif = Image.new('RGB', im.size) # not handling opacity
no_exif.putdata(data) # should strip exif
out_path = full_path.split(ext)[0] + 'clean.jpg'
no_exif.save(out_path, 'JPEG', quality=95) # compressing should remove steganography

Resources