Using Spark DataFrame directly in Keras (databricks) - apache-spark

I have some text that I am looking to classify with keras. I have created a pipeline that takes the text and does some transformations on it and eventually one hot encodes it.
Now, I want to pass that OneHotEncoded column directly into keras in databricks along with the label column, but I cannot seem to do it. All of the examples that I see seem to start with a pandas dataframe and then convert to to a numpy array. But it seems counterproductive to take my pyspark dataframe and convert it.
model.fit(trainingData.select('featuresFirst'), trainingData.select('label'))
gives me:
AttributeError: 'DataFrame' object has no attribute 'values'
model.fit(trainingData.select('featuresFirst').collect(), trainingData.select('label').collect())
gives me:
AttributeError: ndim
What am I missing here?

Related

Same sklearn pipeline different results

I have created a pipeline based on:
Custom tfidfvectorizer to transform tf IDF vector as dataframe (600 features)
Custom Features generator to create new features (5)
Feature Union to join the two dataframes. I checked the output is an array, so no feature names. (605)
Xgboost classifier model seed and random state included (8 classes as labels names)
If I fit and use de pipeline in Jupyter notebook, I obtain good F1 scores.
However, when I save it (using pickle, joblib or dill), and later load it in another notebook or script, I cannot always reproduce the results! I cannot understand it because the input for testing is always the same.. and the python environment!
Could you help me with some suggestions?
Thanks!
Tried to save the pipeline with different libraries.
DenseTransformer in some points
Column transform instead of feature Union
I cannot use pmml library due to some restrictions
Etc
The problem is the same

How to slice Kinetics400 training dataset? (pytorch)

I am trying to run the official script for video classification.
I want to tweak some functions and running through all examples would cost me too much time.
I wonder how can I slice the training kinetics dataset based on that script.
This is the code I added before
train_sampler = RandomClipSampler(dataset.video_clips, args.clips_per_video)
in the script: (let's say I just want to run 100 examples.)
tr_split_len = 100
dataset = torch.utils.data.random_split(dataset, [tr_split_len, len(dataset)-tr_split_len])[0]
Then when hitting train_sampler = RandomClipSampler(dataset.video_clips, args.clips_per_video)
, it pops out the error:
AttributeError: 'Subset' object has no attribute 'video_clips'
Yeah, so the type of dataset converts from torchvision.datasets.kinetics.Kinetics400 to torch.utils.data.dataset.Subset.
I understand. So how can I do it? (hopefully not the way using break in the dataloader loop).
Thanks.
It seems that torchvision.datasets.kinetics.Kinetics400 internally uses an object of class VideoClips to store the information about the clips. It is stored in the member variable Kinetics4000().video_clips.
The VideoClips class has a function called subset, that takes a list of indices and returns a new VideoClips object with only the clips with the specified indices. You could then just replace the old VideoClips object with the new one in your dataset.

Separate tensorflow dataset to different outputs in tensorflow2

I have a dataset with 3 tensor outputs of data, label and path:
import tensorflow as tf #tensroflow version 2.1
data=tf.constant([[0,1],[1,2],[2,3],[3,4],[4,5],[5,6],[6,7],[7,8],[8,9],[9,0]],name='data')
labels=tf.constant([0,1,0,1,0,1,0,1,0,1],name='label')
path=tf.constant(['p0','p1','p2','p3','p4','p5','p6','p7','p8','p9'],name='path')
my_dataset=tf.data.Dataset.from_tensor_slices((data,labels,path))
I want to separate my_dataset back to 3 datasets of data, labels and paths (or 3 tensors) without iterating over it and without converting it to numpy.
In tensorflow 1.X this is done simply using
d,l,p=my_dataset.make_one_shot_iterator().get_next()
and then converting the tensors to datasets. How to do it in tensorflow2?
Thanks!
The solution I found does not look very "pythonic" but it works.
I used the map() method:
data= my_dataset.map(lambda x,y,z:x)
labels= my_dataset.map(lambda x,y,z:y)
paths= my_dataset.map(lambda x,y,z:z)
After this separation, the order of the labels stays the same.

I want to understand Python prediction model

I am trying to code a predictive model and i Found this code somewhere and wanted to know what it does mean please. Here it is "X_train.reset_index(inplace = True)"?
I think it would help if you provide more context to your question. But in the meanwhile, it seems that the line of code that you have shown here is enumerating the training dataset of whatever model you're working with (usually X denotes the data and Y denotes the labels).
The dataset is a pandas DataFrame object, and the reset_index function enumerates the items in the DataFrame so that each item in the DataFrame is numbered instead of named. You can find more information about this in the documentation for this method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

Using to_pickle for very large dataframes

I have a very large dataframe having a shape (16 million, 147). I am trying to persist that dataframe using to_pickle but program is neither serializing that dataframe nor throwing any exception. Does anyone have any idea about it? Suggest me any other format for persisting dataframe. I have tried hdf5 but I am having mixed type object hence I cannot use that format.

Resources