Loading dataset in UINT8 format- python - python-3.x

I was actually looking through the "load_data()" function in python that returns X_train, X_test, Y_train and Y_test as in this link. As you see it is for CIFAR10 and CIFAR100 dataset, that returns the above mentioned values as uint8 array.
I wanted to know is there some other function like this for loading datasets in our system locally ?
If so please help me with its usage and if not please suggest me some other alternative.
Thanks in advance.

load_data() is not a part of python but rather is defined in keras.datasets.cifar10 module. To load cifar dataset (or any other dataset), there might be many methods depending upon how the dataset in packaged/formatted. Usually, the module pandas can be used for loading/saving/manipulating table-like data.
For cifar data, here is another example: loading an image from cifar-10 dataset
Here the author is using the pickle module to unpack the dataset and then PIL and numpy modules to load and manipulate indivdual images.

Related

How to build a customized dataset from MNIST in pytorch

I am importing MNIST dataset as train_data_MNIST = torchvision.datasets.MNIST(root=path+"MNIST", train=True,transform=transforms, download=True)and I am trying to make a smaller dataset from MNIST, let's say the first 10,000 images and corresponding labels. I know this can be handled with torch.utils.data.Subset. But what I want is a torchvision.datasets object (if I directly apply torch.utils.data.Subset to the train_data_MNIST that I list above, the result is an object from torch.utils.data.Subset class).
Is there any possible way such that I can use a fraction of the original MNIST dataset to create a new dataset (not subset)?
Thanks in advance.
What about modifying data and targets directly? For example:
dataset = torchvision.datasets.MNIST(root=path+"MNIST", train=True,transform=transforms, download=True)
dataset.data = dataset.data[:10000]
dataset.targets = dataset.targets[:10000]

Python XGBoost prediction discrepancies with DMatrix

I found there are 2 problems with xbgoost predictions. I trained the model with XGBClassifier and tried to load the model using Booster for prediction, I found
Predictions are slightly different using xbg.Booster and xgb.Classifier, see below.
Predictions are different between list and numpy array when using DMatrix, see below,
Some difference is quite big, I am not sure why this is happening and which prediction should be the source of truth?
For the second question, your data types could change when you convert a list to a numpy array (depending on the numpy version you're using). For example on numpy 1.19.5, try converting list ["1",1] to a numpy array and see the result.

TF-GAN tutorial by Google

I am studying the tutorial on GANs by Google. In this notebook they have defined input_fn in which MNIST dataset is loaded using tfds. I have generated my own dataset and have stored that in numpy array(shape : 4500, 512, 512).
I can't understand how input_fn works and how I can modify it so that I can input training data from my gdrive rather than downloading from tf datasets. I have noticed that input_fn is also used while training when gan_estimator.train is called. Can anyone explain how this function works?
The function input_fn uses TensorFlow datasets in the following line to load MNIST.
tfds.load('mnist', split=split)
.map(_preprocess)
.cache()
.repeat()
You need to understand how TensorFlow datasets work to make one your own following the required structure to be processed.
You can get more information here.

Loading Training Images using Keras

To train a model using Keras, should I load all the images I have to an array to create something like
x_train, y_train
Or is there a better way to read the images on the fly while training. I am not looking for ImageDataGenerator class since my output is an array of points not classes based on directory names..
I managed to get my data csv file to contain the array of points and image file name in 9 columns as follows:
x1 x2 ..... x8 Image_file_name
You can use this data with ImageDataGenerator. You incorrectly assume that it needs folders for classes, but that only applies to flow_from_directory. The method flow_from_dataframe allows you to load data from a Pandas dataframe, from where you can load your data, for example:
idg = ImageDataGenerator(...)
df = pd.load_csv('your_data.csv')
generator = idf.flow_from_dataframe(directory='image folder', x_col = 'filename_column',
y_col = ['col1', 'col2', ..., 'coln'],
class_mode='other')
This generator will data from the dataframe, load the image filename in directory as specified by the value of x_col, and use the corresponding row to build the targets, which in this case will be a numpy array of the values of columns in y_col. More information about this method can be found in the keras documentation.
Loading the entire data set in memory in an array is not a great idea because the memory consumption could go out of control, so you should use a generator. ImageDataGenerator and flow_from_dataframe are a great way of loading images in Keras. Since you don't want to use ImageDataGenerator(can you mention why?) you can create your own generator function that loads chunks of images in memory. If you load your data in a generator make sure you use fit_generator and predict_generator functions.
To load unlabeled data you can do the following hack:
datagen = ImageDataGenerator()
test_data = datagen.flow_from_directory('.', classes=['directory_where_images_are_stored'])
For more information check out link [1].
[1] https://kylewbanks.com/blog/loading-unlabeled-images-with-imagedatagenerator-flowfromdirectory-keras

Pickle for datapreprocessing

I was going through various tutorials and articles on using pickle on the ml model so that that can be used later.
But I am not able to get something pickle or something similar for data pre- processing. I am doing the preprocessing:
Changing the datatype of few columns/features.
Feature engineering.
Hot Encoding/Dummy variables
Scaling the data using below code
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Now, I want to do this for every dataset which I pass for predictions.
Is there any way to do something like pickle to load the data preprocessing steps before I was this to loaded ML model from pickle.
Please guide
I created a function and saved it a independent file. Then called that function whenever required.
Below is the code on how I am calling the data pre process function
from DataPreparationv3 import Data_Preprocess
Base_Data = pd.read_csv('Validate.csv')
DataReady = Data_Preprocess(Base_Data)
This solved my problem.
Regards
Sudhir

Resources