How to output CoordinateMatrix in tabular format? - apache-spark

I need to produce an output table of a subset of movielens rating data. I have converted my dataframe to a CoordinateMatrix:
from pyspark.mllib.linalg.distributed import MatrixEntry, CoordinateMatrix
mat = CoordinateMatrix(ratings.map(
lambda r: MatrixEntry(r.user, r.product, r.rating)))
However, I can't see how I can print the output in a tabular format. I can print the entries:
mat.entries.collect()
Which outputs:
[MatrixEntry(1, 1, 5.0),
MatrixEntry(5, 6, 2.0),
MatrixEntry(6, 1, 4.0),
MatrixEntry(7, 6, 4.0),
MatrixEntry(8, 1, 4.0),
MatrixEntry(8, 4, 3.0),
MatrixEntry(9, 1, 5.0)]
However, I'm looking to output:
1 2 3 4 5 6 7 8 9
------------------------------------- ...
1 | 5
2 |
3 |
4 |
5 | 2
...
Update
The pandas equivalent is pivot_table, e.g.
import pandas as pd
import numpy as np
import os
import requests
import zipfile
np.set_printoptions(precision=4)
filename = 'ml-1m.zip'
if not os.path.exists(filename):
r = requests.get('http://files.grouplens.org/datasets/movielens/ml-1m.zip', stream=True)
if r.status_code == 200:
with open(filename, 'wb') as f:
for chunk in r:
f.write(chunk)
else:
raise 'Could not save dataset'
zip_ref = zipfile.ZipFile('ml-1m.zip', 'r')
zip_ref.extractall('.')
zip_ref.close()
ratingsNames = ["userId", "movieId", "rating", "timestamp"]
ratings = pd.read_table("./ml-1m/ratings.dat", header=None, sep="::", names=ratingsNames, engine='python')
ratingsMatrix = ratings.pivot_table(columns=['movieId'], index =['userId'], values='rating', dropna = False)
ratingsMatrix = ratingsMatrix.fillna(0)
# we don't have space to print the full matrix, just show the first few cells
print(ratingsMatrix.ix[:9, :9])
Which outputs:
movieId 1 2 3 4 5 6 7 8 9
userId
1 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0
6 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0
8 4.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
9 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Related

Calculate mutual information in columns

I have been set a sample exercise by my teacher. It is to reduce dimensionality by writing a function that uses sklearn(mutual information).I am not that good in it but I tried many ways. Its not giving me any reliable answer even. I am unable to find out the mistake.
The data consists of 19 columns that i got with one hot encoding. And i named it as dummy. whenever i run the code it does not give me any output. neither error nor result.
first i am not sure what to set the threshold.
2nd how to call the mutual information source from sklearn and iterate every column in a pair, to drop one out of the highly correlated columns pair.
Address_A Address_B Address_C Address_D Address_E Address_F Address_G Address_H DoW_0 DoW_1 DoW_2 DoW_3 DoW_4 DoW_5 DoW_6 Month_1 Month_11 Month_12 Month_2
0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
2 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
252199 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
252200 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
252201 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
252202 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
252203 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
from sklearn.metrics import mutual_info_score
def reduce_dimentionality(dummy, threshold):
df_cols = dummy[['Address_A','Address_B','Address_C','Address_D','Address_E','Address_F','Address_G','Address_H',
'DoW_0','DoW_1','DoW_2','DoW_3','DoW_4','DoW_5','DoW_6','Month_1','Month_11','Month_12','Month_2']]
to_remove = []
for col_ix, Address_A in enumerate(df_cols):
for address_B in df_cols:
calc_MI=sklearn.metrics.mutual_info_score
mu_info = calc_MI(dummy['Address_A'],dummy['Address_B'], bins=20)
if mu_info <1:
d=to_remove.append(Address_A)
new_data_frame = pd.DataFrame.drop(d)
return new_data_frame

ValueError with MinMaxScaler inverse_transform

I am trying to fit an LSTM network to a dataset.
I have the following dataset:
0 17.6 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
1 38.2 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
2 39.4 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
3 38.7 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
4 39.7 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17539 56.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
17540 51.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
17541 46.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
17542 44.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
17543 40.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0
27 28 29 30 31 32 33
0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
1 0.0 0.0 1.0 0.0 0.0 1.0 0.0
2 0.0 0.0 1.0 0.0 0.0 1.0 0.0
3 0.0 0.0 1.0 0.0 0.0 1.0 0.0
4 0.0 0.0 1.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ...
17539 0.0 0.0 0.0 0.0 1.0 0.0 1.0
17540 0.0 0.0 0.0 0.0 1.0 0.0 1.0
17541 0.0 0.0 0.0 0.0 1.0 0.0 1.0
17542 0.0 0.0 0.0 0.0 1.0 0.0 1.0
17543 0.0 0.0 0.0 0.0 1.0 0.0 1.0
with shape:
[17544 rows x 34 columns]
Then I scale it with MinMaxScaler as follows:
scaler = MinMaxScaler(feature_range=(0,1))
data = scaler.fit_transform(data)
Then I am using a function to create my train, test dataset with shapes:
X_train : (12232, 24, 34)
Y_train : (12232, 24)
X_test : (1708, 24, 34)
Y_test : (1708, 24)
After I fit the model and I predict the values for the test set, I need to scale back to the original values and I do the following:
test_predict = model.predict(X_test)
test_predict = scaler.inverse_transform(test_predict)
Y_test = scaler.inverse_transform(Y_test)
But I am getting the following error:
ValueError: operands could not be broadcast together with shapes (1708,24) (34,) (1708,24)
How can I resolve it?
The inverse transformation expects the data in the same shape with the one produced after the transform, i.e with 34 columns. This is not the case with your test_predict, neither with your y_test.
Additionally, although irrelevant to your error, you are committing the mistake of scaling first and splitting to train/test afterwards, which is not the correct methodology as it leads to data leakage.
Here are the necessary steps to resolve this:
Split first to train & test sets
Transform your X_train and y_train using two different scalers for the features and output respectively, as I show in this answer of mine; you should use .fit_transform here.
Fit your model with the transformed X_train and y_train (side note: it is good practice to use different names for different versions of the data, instead of overwriting the existing ones).
To evaluate your model with the test data X_test & y_test, first transform them using the respective scalers from step #2; you should use .transform here (not .fit_transform again).
In order to get your predictions y_pred back to the scale of your original y_test, you should use .inverse_transform of the respective scaler on them. There is of course no need to inverse transform your transformed X_test and y_test - you already have these values!

How to handle correctly sparse features to avoid poor performance of classification neural network?

I'm trying to understand how sparse neural networks work. I have a very sparse data of about 40k rows for two classes. The dataset looks like this:
RA0 RA1 RA2 RA3 RA4 RA5 RA6 RA7 RA8 RA9 RB0 RB1 RB2 RB3 RB4 RB5 RB6 RB7 RB8 RB9
50 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
51 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
52 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
53 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
54 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
55 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
56 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
57 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
58 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
59 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
60 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
61 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
62 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
63 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
As you can see, some rows have only 0's on it. The columns with name RA are the features of a class 0 and the columns with name RB are the features of class 1, so the same dataset with the actual labels looks like this:
RA0 RA1 RA2 RA3 RA4 RA5 RA6 RA7 RA8 RA9 ... RB1 RB2 RB3 RB4 RB5 RB6 RB7 RB8 RB9 label
50 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
51 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
52 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
53 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
54 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
55 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
56 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
57 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
58 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
59 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
60 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
I did a simple neural network model using Keras, but the model isn't learning and accuracy rarely goes beyond 52% on train dataset. I tried two variations of the same model:
Variation 1:
def build_nn(n_features,lr = 0.001):
_input = Input(shape = (n_features,),name = 'input',sparse = True)
x = Dense(12,kernel_initializer = 'he_uniform',activation = 'relu')(_input)
x = Dropout(0.5)(x)
x = Dense(8,kernel_initializer = 'he_uniform',activation = 'relu')(x)
x = Dropout(0.5)(x)
x = Dense(2,kernel_initializer = 'he_uniform',activation = 'softmax')(x)
nn = Model(inputs = [_input],outputs = [x])
nn.compile(loss='sparse_categorical_crossentropy',optimizer=Adam(lr = lr),metrics=['accuracy'])
return nn
Variation 2:
def build_nn(feature_layer,lr = 0.001):
feature_inputs = {}
for feature in feature_layer:
feature_inputs[feature.key] = Input(shape = (1,),name = feature.key)
feature_layer = tf.keras.layers.DenseFeatures(feature_layer)
feature_inputs_n = feature_layer(feature_inputs)
x = Dense(12,kernel_initializer = 'he_uniform',activation = 'relu')(feature_inputs_n)
x = Dropout(0.5)(x)
x = Dense(8,kernel_initializer = 'he_uniform',activation = 'relu')(x)
x = Dropout(0.5)(x)
x = Dense(2,kernel_initializer = 'he_uniform',activation = 'softmax')(x)
nn = Model(inputs = [v for v in feature_inputs.values()],outputs = [x])
nn.compile(loss='sparse_categorical_crossentropy',optimizer=Adam(lr = lr),metrics=['accuracy'])
return nn
The motivation behind doing the variation 2 is because the features are sparse and I thought that this could have an impact on the model's performance, so I followed this tensorflow guide.
Also, the labels are converted to a categorical label using to_categorical function, provided by the keras api:
y_train2 = to_categorical(y_train)
y_test2 = to_categorical(y_test)
My questions are:
Is my model wrong (especially the variation 2) or if I'm doing the wrong representation of the sparse features and how this features should be handled?
The RA and RB are the features of two different classes and since there are rows full of 0, should I add a third class representing an unknown class or remove the rows that contains only 0?
Since RA and RB map two different classes, should I do two separate model, one for columns RA and class 0 and the other for columns RB and class 1?
I'm also posting an image of the train/test model's accuracy:
I can also provide any other part of the code if needed.
EDIT:
I didn't put this part because I felt it doesn't has a relation to what I was asking, but it seems I was wrong.
Each feature is an individual branch from a sklearn decision tree. The class that the decision tree looks for is an up or down for the next candle in a trading enviroment (a candle is a price aggregation of an instrument in time that has an open, low, high and close price). Then, the idea is to grab those branches, that are valuated in the price time series, and evaluate if the condition is met, so if the branch is active the value is 1.
For example, branch RA0 at index 55 is active, so the value is 1. The labels are calculated as np.sign(close - open). So, the idea is that by using multiple branches the classification of the label can be improved, by having a neural network that can see if which branch is active and which one has more weight in order to make a classification.
The use of sparse_categorical_crossentropy is wrong here; the sparsity in sparse_categorical_crossentropy refers to the label representation, and not to the features. Since you are using one-hot encoded labels:
y_train2 = to_categorical(y_train)
y_test2 = to_categorical(y_test)
and a final layer of 2 nodes with activation = 'softmax' (which I take it to mean that you have only 2 classes), you should switch to loss='categorical_crossentropy' irrespectively of the sparsity in your features.
Other general remarks:
Remove dropout, which should never be used by default. Dropout is used to help against overfitting if such a thing is detected; used uncritically (even worse, with such high values), it is well-known to prevent training altogether (i.e. something very similar to what you report here).
Remove kernel_initializer = 'he_uniform' from all layers, thus leaving the default glorot_uniform one (useful hint: default values are there for a reason, and it is not advisable to play with them unless you have a specific reason to do so and you know exactly what you are doing).

unpivot a dataframe with many columns into one with only 3

If I have a Dataframe which looks like:
clientid CLNT1 CLNT2 CLNT3 CLNT4 ... CLNTN
tradedate ...
2019-07-01 0.0 0.0 0.0 0.0 ... 12.0
2019-07-02 0.0 0.0 0.0 0.0 ... 0.0
2019-07-03 0.0 0.0 0.0 0.0 ... 0.0
2019-07-05 0.0 0.0 0.0 0.0 ... 0.0
2019-07-08 0.0 0.0 0.0 0.0 ... 0.0
... ... ... ... ... ... ...
2020-01-31 0.0 0.0 0.0 0.0 ... 0.0
2020-02-03 0.0 0.0 0.0 0.0 ... 0.0
2020-02-04 0.0 0.0 0.0 0.0 ... 0.0
2020-02-05 0.0 0.0 0.0 0.0 ... 0.0
2020-02-06 0.0 0.0 0.0 0.0 ... 0.0
How can I collapse it into something like:
clientid count
tradedate
2019-07-01 CLNT1 0.0
2019-07-01 CLNT2 0.0
2019-07-01 CLNT3 0.0
2019-07-01 CLNT4 0.0
... ... ...
2019-07-01 CLNTN 12.0
Apologies if this has been answered already. Rather new to pandas...

How to substitute None values in python pandas

I'm trying to work with a dataset that has None values:
My uploading code is the following:
import pandas as pd
import io
import requests
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
c = pd.DataFrame(s_rows_cols, columns = header_row)
and
the output from c is :
But it seems that there are some columns that has None values.
How do I replace this None values by zeros?
Thanks
I think it is not necessary, if use read_csv with sep=\s+ for whitespace separator and also parameter names for specify new columns names:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
cols = ['age','sex','chestpain','restBP','chol','sugar','ecg',
'maxhr','angina','dep','exercise','fluor','thal','diagnosis']
df = pd.read_csv(url, sep='\s+', names=cols)
print (df)
age sex chestpain restBP chol sugar ecg maxhr angina dep \
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2
.. ... ... ... ... ... ... ... ... ... ...
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5
exercise fluor thal diagnosis
0 2.0 3.0 3.0 2
1 2.0 0.0 7.0 1
2 1.0 0.0 7.0 2
3 2.0 1.0 7.0 1
4 1.0 1.0 3.0 1
.. ... ... ... ...
265 1.0 0.0 7.0 1
266 1.0 0.0 7.0 1
267 2.0 0.0 3.0 1
268 2.0 0.0 6.0 1
269 2.0 3.0 3.0 2
[270 rows x 14 columns]
Then in data are not Nones and no missing values:
print (df.isna().any(1).any())
False
EDIT:
If need replace missing values or Nones to scalar use fillna:
c = c.fillna(0)

Resources