I'm trying to work with a dataset that has None values:
My uploading code is the following:
import pandas as pd
import io
import requests
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
c = pd.DataFrame(s_rows_cols, columns = header_row)
and
the output from c is :
But it seems that there are some columns that has None values.
How do I replace this None values by zeros?
Thanks
I think it is not necessary, if use read_csv with sep=\s+ for whitespace separator and also parameter names for specify new columns names:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
cols = ['age','sex','chestpain','restBP','chol','sugar','ecg',
'maxhr','angina','dep','exercise','fluor','thal','diagnosis']
df = pd.read_csv(url, sep='\s+', names=cols)
print (df)
age sex chestpain restBP chol sugar ecg maxhr angina dep \
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2
.. ... ... ... ... ... ... ... ... ... ...
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5
exercise fluor thal diagnosis
0 2.0 3.0 3.0 2
1 2.0 0.0 7.0 1
2 1.0 0.0 7.0 2
3 2.0 1.0 7.0 1
4 1.0 1.0 3.0 1
.. ... ... ... ...
265 1.0 0.0 7.0 1
266 1.0 0.0 7.0 1
267 2.0 0.0 3.0 1
268 2.0 0.0 6.0 1
269 2.0 3.0 3.0 2
[270 rows x 14 columns]
Then in data are not Nones and no missing values:
print (df.isna().any(1).any())
False
EDIT:
If need replace missing values or Nones to scalar use fillna:
c = c.fillna(0)
Related
I am trying to fit an LSTM network to a dataset.
I have the following dataset:
0 17.6 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
1 38.2 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
2 39.4 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
3 38.7 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
4 39.7 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17539 56.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
17540 51.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
17541 46.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
17542 44.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
17543 40.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0
27 28 29 30 31 32 33
0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
1 0.0 0.0 1.0 0.0 0.0 1.0 0.0
2 0.0 0.0 1.0 0.0 0.0 1.0 0.0
3 0.0 0.0 1.0 0.0 0.0 1.0 0.0
4 0.0 0.0 1.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ...
17539 0.0 0.0 0.0 0.0 1.0 0.0 1.0
17540 0.0 0.0 0.0 0.0 1.0 0.0 1.0
17541 0.0 0.0 0.0 0.0 1.0 0.0 1.0
17542 0.0 0.0 0.0 0.0 1.0 0.0 1.0
17543 0.0 0.0 0.0 0.0 1.0 0.0 1.0
with shape:
[17544 rows x 34 columns]
Then I scale it with MinMaxScaler as follows:
scaler = MinMaxScaler(feature_range=(0,1))
data = scaler.fit_transform(data)
Then I am using a function to create my train, test dataset with shapes:
X_train : (12232, 24, 34)
Y_train : (12232, 24)
X_test : (1708, 24, 34)
Y_test : (1708, 24)
After I fit the model and I predict the values for the test set, I need to scale back to the original values and I do the following:
test_predict = model.predict(X_test)
test_predict = scaler.inverse_transform(test_predict)
Y_test = scaler.inverse_transform(Y_test)
But I am getting the following error:
ValueError: operands could not be broadcast together with shapes (1708,24) (34,) (1708,24)
How can I resolve it?
The inverse transformation expects the data in the same shape with the one produced after the transform, i.e with 34 columns. This is not the case with your test_predict, neither with your y_test.
Additionally, although irrelevant to your error, you are committing the mistake of scaling first and splitting to train/test afterwards, which is not the correct methodology as it leads to data leakage.
Here are the necessary steps to resolve this:
Split first to train & test sets
Transform your X_train and y_train using two different scalers for the features and output respectively, as I show in this answer of mine; you should use .fit_transform here.
Fit your model with the transformed X_train and y_train (side note: it is good practice to use different names for different versions of the data, instead of overwriting the existing ones).
To evaluate your model with the test data X_test & y_test, first transform them using the respective scalers from step #2; you should use .transform here (not .fit_transform again).
In order to get your predictions y_pred back to the scale of your original y_test, you should use .inverse_transform of the respective scaler on them. There is of course no need to inverse transform your transformed X_test and y_test - you already have these values!
I have two df i want multiplied together without looping through.
For each of the below rows, i want to iterate the below and for each of the arm angle
sample pos arm1_angle arm2_angle arm3_angle arm4_angle
0 0 0.000000 0.000000 0.250000 0.500000 0.750000
1 1 0.134438 0.134438 0.384438 0.634438 0.884438
2 2 0.838681 0.838681 0.088681 0.338681 0.588681
3 3 1.755019 0.755019 0.005019 0.255019 0.505019
4 4 3.007274 0.007274 0.257274 0.507274 0.757274
5 5 4.186825 0.186825 0.436825 0.686825 0.936825
6 6 3.455513 0.455513 0.705513 0.955513 0.205513
7 7 4.916564 0.916564 0.166564 0.416564 0.666564
8 8 2.876257 0.876257 0.126257 0.376257 0.626257
9 9 2.549585 0.549585 0.799585 0.049585 0.299585
10 10 1.034488 0.034488 0.284488 0.534488 0.784488
multiply by the below table and concatenate. For example, if there are 10k rows above, the sample size will be 10k x 27 = 270,000 rows.
so for index 0, multiply the entire below table with 0 for arm1, 0.25 for arm2, 0.5 for arm3, and 0.75 for arm3.
I can easily loop through, multiple and concatenate. Is there a more efficient way?
id radius bag_count arm1 arm2 arm3 arm4
0 1 0.440 4 1.0 0.0 0.0 0.0
1 2 0.562 8 0.0 1.0 0.0 0.0
2 3 0.666 12 0.0 0.0 1.0 0.0
3 4 0.818 16 1.0 0.0 0.0 0.0
4 5 0.912 16 0.0 1.0 0.0 0.0
5 6 1.022 20 0.0 0.0 1.0 0.0
6 7 1.120 24 1.0 0.0 0.0 0.0
7 8 1.220 28 0.0 1.0 0.0 0.0
8 9 1.350 32 0.0 0.0 1.0 0.0
9 10 1.460 36 1.0 0.0 1.0 0.0
10 11 1.570 40 0.0 1.0 0.0 1.0
11 12 1.680 44 1.0 0.0 1.0 0.0
12 13 1.800 44 0.0 1.0 0.0 1.0
13 14 1.920 48 1.0 0.0 1.0 0.0
14 15 2.030 52 0.0 1.0 0.0 1.0
15 16 2.140 56 1.0 0.0 1.0 0.0
16 17 2.250 60 0.0 1.0 1.0 1.0
17 18 2.360 64 1.0 0.0 1.0 1.0
18 19 2.470 68 1.0 1.0 0.0 1.0
19 20 2.580 72 1.0 1.0 1.0 0.0
20 21 2.700 72 0.0 1.0 1.0 1.0
21 22 2.810 76 1.0 0.0 1.0 1.0
22 23 2.940 80 1.0 1.0 0.0 1.0
23 24 3.060 84 1.0 1.0 1.0 0.0
24 25 3.180 88 1.0 1.0 1.0 1.0
25 26 3.300 92 1.0 1.0 1.0 1.0
26 27 3.420 96 1.0 1.0 1.0 1.0
Use cross join for all rows and then select rows with arm and multiple:
df22 = df2.filter(like='arm')
cols = df1.filter(like='arm').columns
df = df1.merge(df22, how='cross')
df[cols] = df[cols].mul(df[df22.columns].to_numpy())
df = df.drop(df22.columns, axis=1)
my credit credit_scoring.csv is like this how can i make it in an organised way 14 column and each column has it's corresponding value
Seniority;Home;Time;Age;Marital;Records;Job;Expenses;Income;Assets;Debt;Amount;Price;Status
0 9.0;1.0;60.0;30.0;0.0;1.0;1.0;73.0;129.0;0.0;0...
1 17.0;1.0;60.0;58.0;1.0;1.0;0.0;48.0;131.0;0.0;...
2 10.0;0.0;36.0;46.0;0.0;2.0;1.0;90.0;200.0;3000...
3 0.0;1.0;60.0;24.0;1.0;1.0;0.0;63.0;182.0;2500....
4 0.0;1.0;36.0;26.0;1.0;1.0;0.0;46.0;107.0;0.0;0...
. .................................................
. .................................................
. .................................................
. .................................................
You can simply use read_csv() with sep=';'
Your example data isn't great, but I tried to do the most of it.
I saved it as a.csv and here is the code:
In [1]: import pandas as pd
In [2]: pd.read_csv('a.csv', sep=';')
Out[2]:
Seniority Home Time Age Marital Records Job Expenses Income Assets Debt Amount Price Status
0 9.0 1.0 60.0 30.0 0.0 1.0 1.0 73.0 129.0 0.0 0.0 NaN NaN NaN
1 17.0 1.0 60.0 58.0 1.0 1.0 0.0 48.0 131.0 0.0 NaN NaN NaN NaN
2 10.0 0.0 36.0 46.0 0.0 2.0 1.0 90.0 200.0 3000.0 NaN NaN NaN NaN
3 0.0 1.0 60.0 24.0 1.0 1.0 0.0 63.0 182.0 2500.0 NaN NaN NaN NaN
4 0.0 1.0 36.0 26.0 1.0 1.0 0.0 46.0 107.0 0.0 0.0 NaN NaN NaN
I try to get new columns a and b based on the following dataframe:
a_x b_x a_y b_y
0 13.67 0.0 13.67 0.0
1 13.42 0.0 13.42 0.0
2 13.52 1.0 13.17 1.0
3 13.61 1.0 13.11 1.0
4 12.68 1.0 13.06 1.0
5 12.70 1.0 12.93 1.0
6 13.60 1.0 NaN NaN
7 12.89 1.0 NaN NaN
8 11.68 1.0 NaN NaN
9 NaN NaN 8.87 0.0
10 NaN NaN 8.77 0.0
11 NaN NaN 7.97 0.0
If b_x or b_y are 0.0 (at this case they have same values if they both exist), then a_x and b_y share same values, so I take either of them as new columns a and b; if b_x or b_y are 1.0, they are different values, so I calculate means of a_x and a_y as the values of a, take either b_x and b_y as b;
If a_x, b_x or a_y, b_y is not null, so I'll take existing values as a and b.
My expected results will like this:
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0
1 13.42 0.0 13.42 0.0 13.420 0
2 13.52 1.0 13.17 1.0 13.345 1
3 13.61 1.0 13.11 1.0 13.360 1
4 12.68 1.0 13.06 1.0 12.870 1
5 12.70 1.0 12.93 1.0 12.815 1
6 13.60 1.0 NaN NaN 13.600 1
7 12.89 1.0 NaN NaN 12.890 1
8 11.68 1.0 NaN NaN 11.680 1
9 NaN NaN 8.87 0.0 8.870 0
10 NaN NaN 8.77 0.0 8.770 0
11 NaN NaN 7.97 0.0 7.970 0
How can I get an result above? Thank you.
Use:
#filter all a and b columns
b = df.filter(like='b')
a = df.filter(like='a')
#test if at least one 0 or 1 value
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)
#get means of a columns
a1 = a.mean(axis=1)
#forward filling mising values and select last column
b1 = b.ffill(axis=1).iloc[:, -1]
a2 = a.ffill(axis=1).iloc[:, -1]
#new Dataframe with 2 conditions
df1 = pd.DataFrame(np.select([m1, m2], [[a2, b1], [a1, b1]]), index=['a','b']).T
#join to original
df = df.join(df1)
print (df)
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0.0
1 13.42 0.0 13.42 0.0 13.420 0.0
2 13.52 1.0 13.17 1.0 13.345 1.0
3 13.61 1.0 13.11 1.0 13.360 1.0
4 12.68 1.0 13.06 1.0 12.870 1.0
5 12.70 1.0 12.93 1.0 12.815 1.0
6 13.60 1.0 NaN NaN 13.600 1.0
7 12.89 1.0 NaN NaN 12.890 1.0
8 11.68 1.0 NaN NaN 11.680 1.0
9 NaN NaN 8.87 0.0 8.870 0.0
10 NaN NaN 8.77 0.0 8.770 0.0
11 NaN NaN 7.97 0.0 7.970 0.0
But I think solution should be simplify, because mean should be used for both conditions (because mean of same values is same like first value):
b = df.filter(like='b')
a = df.filter(like='a')
m1 = b.eq(0).any(axis=1)
m2 = b.eq(1).any(axis=1)
a1 = a.mean(axis=1)
b1 = b.ffill(axis=1).iloc[:, -1]
df['a'] = a1
df['b'] = b1
print (df)
a_x b_x a_y b_y a b
0 13.67 0.0 13.67 0.0 13.670 0.0
1 13.42 0.0 13.42 0.0 13.420 0.0
2 13.52 1.0 13.17 1.0 13.345 1.0
3 13.61 1.0 13.11 1.0 13.360 1.0
4 12.68 1.0 13.06 1.0 12.870 1.0
5 12.70 1.0 12.93 1.0 12.815 1.0
6 13.60 1.0 NaN NaN 13.600 1.0
7 12.89 1.0 NaN NaN 12.890 1.0
8 11.68 1.0 NaN NaN 11.680 1.0
9 NaN NaN 8.87 0.0 8.870 0.0
10 NaN NaN 8.77 0.0 8.770 0.0
11 NaN NaN 7.97 0.0 7.970 0.0
I need to produce an output table of a subset of movielens rating data. I have converted my dataframe to a CoordinateMatrix:
from pyspark.mllib.linalg.distributed import MatrixEntry, CoordinateMatrix
mat = CoordinateMatrix(ratings.map(
lambda r: MatrixEntry(r.user, r.product, r.rating)))
However, I can't see how I can print the output in a tabular format. I can print the entries:
mat.entries.collect()
Which outputs:
[MatrixEntry(1, 1, 5.0),
MatrixEntry(5, 6, 2.0),
MatrixEntry(6, 1, 4.0),
MatrixEntry(7, 6, 4.0),
MatrixEntry(8, 1, 4.0),
MatrixEntry(8, 4, 3.0),
MatrixEntry(9, 1, 5.0)]
However, I'm looking to output:
1 2 3 4 5 6 7 8 9
------------------------------------- ...
1 | 5
2 |
3 |
4 |
5 | 2
...
Update
The pandas equivalent is pivot_table, e.g.
import pandas as pd
import numpy as np
import os
import requests
import zipfile
np.set_printoptions(precision=4)
filename = 'ml-1m.zip'
if not os.path.exists(filename):
r = requests.get('http://files.grouplens.org/datasets/movielens/ml-1m.zip', stream=True)
if r.status_code == 200:
with open(filename, 'wb') as f:
for chunk in r:
f.write(chunk)
else:
raise 'Could not save dataset'
zip_ref = zipfile.ZipFile('ml-1m.zip', 'r')
zip_ref.extractall('.')
zip_ref.close()
ratingsNames = ["userId", "movieId", "rating", "timestamp"]
ratings = pd.read_table("./ml-1m/ratings.dat", header=None, sep="::", names=ratingsNames, engine='python')
ratingsMatrix = ratings.pivot_table(columns=['movieId'], index =['userId'], values='rating', dropna = False)
ratingsMatrix = ratingsMatrix.fillna(0)
# we don't have space to print the full matrix, just show the first few cells
print(ratingsMatrix.ix[:9, :9])
Which outputs:
movieId 1 2 3 4 5 6 7 8 9
userId
1 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0
6 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0
8 4.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
9 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0