Separate tensorflow dataset to different outputs in tensorflow2 - python-3.x

I have a dataset with 3 tensor outputs of data, label and path:
import tensorflow as tf #tensroflow version 2.1
data=tf.constant([[0,1],[1,2],[2,3],[3,4],[4,5],[5,6],[6,7],[7,8],[8,9],[9,0]],name='data')
labels=tf.constant([0,1,0,1,0,1,0,1,0,1],name='label')
path=tf.constant(['p0','p1','p2','p3','p4','p5','p6','p7','p8','p9'],name='path')
my_dataset=tf.data.Dataset.from_tensor_slices((data,labels,path))
I want to separate my_dataset back to 3 datasets of data, labels and paths (or 3 tensors) without iterating over it and without converting it to numpy.
In tensorflow 1.X this is done simply using
d,l,p=my_dataset.make_one_shot_iterator().get_next()
and then converting the tensors to datasets. How to do it in tensorflow2?
Thanks!

The solution I found does not look very "pythonic" but it works.
I used the map() method:
data= my_dataset.map(lambda x,y,z:x)
labels= my_dataset.map(lambda x,y,z:y)
paths= my_dataset.map(lambda x,y,z:z)
After this separation, the order of the labels stays the same.

Related

Same sklearn pipeline different results

I have created a pipeline based on:
Custom tfidfvectorizer to transform tf IDF vector as dataframe (600 features)
Custom Features generator to create new features (5)
Feature Union to join the two dataframes. I checked the output is an array, so no feature names. (605)
Xgboost classifier model seed and random state included (8 classes as labels names)
If I fit and use de pipeline in Jupyter notebook, I obtain good F1 scores.
However, when I save it (using pickle, joblib or dill), and later load it in another notebook or script, I cannot always reproduce the results! I cannot understand it because the input for testing is always the same.. and the python environment!
Could you help me with some suggestions?
Thanks!
Tried to save the pipeline with different libraries.
DenseTransformer in some points
Column transform instead of feature Union
I cannot use pmml library due to some restrictions
Etc
The problem is the same

How to build a customized dataset from MNIST in pytorch

I am importing MNIST dataset as train_data_MNIST = torchvision.datasets.MNIST(root=path+"MNIST", train=True,transform=transforms, download=True)and I am trying to make a smaller dataset from MNIST, let's say the first 10,000 images and corresponding labels. I know this can be handled with torch.utils.data.Subset. But what I want is a torchvision.datasets object (if I directly apply torch.utils.data.Subset to the train_data_MNIST that I list above, the result is an object from torch.utils.data.Subset class).
Is there any possible way such that I can use a fraction of the original MNIST dataset to create a new dataset (not subset)?
Thanks in advance.
What about modifying data and targets directly? For example:
dataset = torchvision.datasets.MNIST(root=path+"MNIST", train=True,transform=transforms, download=True)
dataset.data = dataset.data[:10000]
dataset.targets = dataset.targets[:10000]

Python XGBoost prediction discrepancies with DMatrix

I found there are 2 problems with xbgoost predictions. I trained the model with XGBClassifier and tried to load the model using Booster for prediction, I found
Predictions are slightly different using xbg.Booster and xgb.Classifier, see below.
Predictions are different between list and numpy array when using DMatrix, see below,
Some difference is quite big, I am not sure why this is happening and which prediction should be the source of truth?
For the second question, your data types could change when you convert a list to a numpy array (depending on the numpy version you're using). For example on numpy 1.19.5, try converting list ["1",1] to a numpy array and see the result.

Loading dataset in UINT8 format- python

I was actually looking through the "load_data()" function in python that returns X_train, X_test, Y_train and Y_test as in this link. As you see it is for CIFAR10 and CIFAR100 dataset, that returns the above mentioned values as uint8 array.
I wanted to know is there some other function like this for loading datasets in our system locally ?
If so please help me with its usage and if not please suggest me some other alternative.
Thanks in advance.
load_data() is not a part of python but rather is defined in keras.datasets.cifar10 module. To load cifar dataset (or any other dataset), there might be many methods depending upon how the dataset in packaged/formatted. Usually, the module pandas can be used for loading/saving/manipulating table-like data.
For cifar data, here is another example: loading an image from cifar-10 dataset
Here the author is using the pickle module to unpack the dataset and then PIL and numpy modules to load and manipulate indivdual images.

Training Keras model with Dask Array is very slow

I want to use Dask to read a large dataset and feed with it a Keras model. The data consists of audio files and I am using a custom function to read them. I have tried to apply delayed to this function and I collect all of the files in a dask array, as:
x = da.stack([da.from_delayed(delayed(get_item_data)(fp, sr, mono, post_processing, data_shape), shape=data_shape, dtype=np.float32) for fp in df['path']])
(See the source)
To train the Keras model, I compute X and Y as above and I input them to the function fit.
However, the training is very slow. I have tried to change the chunksizeand it is still very slow.
Could you tell me if I am doing something wrong when creating the array? Or any good practices for it?
Thanks
As far as I know Keras doesn't have any built-in support for Dask.arrays. So I'm not sure what will happen when you provide a dask.array directly to Keras functions. My guess is that it will automatically convert the dask.array into a (possibly very large) numpy array.

Resources