Divide pandas data frame into test and train based on unique ID - python-3.x

I want to split it into two data frames,(train, and test) using the values in the id column. The split should be such that in the first data frame I have 70% of the (unique) ids and in the second data frame, I have 30% of the ids. The ids should be randomly split.
I have multiple values corresponding to one id.
The below script I was trying:
Training_data, Test_data = sklearn.model_selection.train_test_split(data, data['ID_sample'].unique(), train_size=0.30, test_size=0.70, random_state=5)

Sorted the issue in the following way
samplelist = data["ID_sample"].unique()
training_samp, test_samp = sklearn.model_selection.train_test_split(samplelist, train_size=0.7, test_size=0.3, random_state=5, shuffle=True)
training_data = data[data['ID_sample'].isin(training_samp)]
test_data = data[data['ID_sample'].isin(test_samp)]

I am not sklearn expert, but I know little about it and since ages when this came into existence similar questions I see all new-comer's used to ask.
Anyway, here is how you can solve it, you can opt to import import train_test_split from sklearn.model_selection and accomplish it. I have just created a random data and applied that same.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(100, 2))
>>> df
0 1
0 -1.214487 0.455726
1 -0.898623 0.268778
2 0.262315 -0.009964
3 0.612664 0.786531
4 1.249646 -1.020366
.. ... ...
95 -0.171218 1.083018
96 0.122685 -2.214143
97 -1.420504 0.469372
98 0.061177 0.465881
99 -0.262667 -0.406031
[100 rows x 2 columns]
>>> from sklearn.model_selection import train_test_split
>>> train, test = train_test_split(df, test_size=0.3)
Here is your first dataframe train
>>> train
0 1
26 -2.651343 -0.864565
17 0.106959 -0.763388
78 -0.398269 -0.501073
25 1.452795 1.290227
47 0.158705 -1.123697
.. ... ...
29 -1.909144 -0.732514
7 0.641331 -1.336896
43 0.769139 2.816528
59 -0.683185 0.442875
11 -0.543988 -0.183677
[70 rows x 2 columns]
Here its second test dataframe:
>>> test
0 1
30 -1.562427 -1.448936
24 0.638780 1.868500
70 -0.572035 1.615093
72 0.660455 -0.331350
82 0.295644 -0.403386
22 0.942676 -0.814718
15 -0.208757 -0.112564
45 1.069752 -1.894040
18 0.600265 0.599571
93 -0.853163 1.646843
91 -1.172471 -1.488513
10 0.728550 1.637609
36 -0.040357 2.050128
4 1.249646 -1.020366
60 -0.907925 -0.290945
34 0.029384 0.452658
38 1.566204 -1.171910
33 -1.009491 0.105400
62 0.930651 -0.124938
42 0.401900 -0.472175
80 1.266980 -0.221378
95 -0.171218 1.083018
74 -0.160058 -1.383118
28 1.257940 0.604513
87 -0.136468 -0.109718
27 1.909935 -0.712136
81 -1.449828 -1.823526
61 0.176301 -0.885127
53 -0.593061 1.547997
57 -0.527212 0.781028
In your case, it should ideally be working as below, however, you don't need to define train_size if you are defining test_size or vice versa.
>>> train, test = train_test_split(data['ID_sample'], test_size=0.3)
OR
>>> train, test = train_test_split(data['ID_sample'], test_size=0.3, random_state=5)
OR
This returns an arraylist ...
>>> train, test = train_test_split(data['ID_sample'].unique(), test_size=0.30, random_state=5)

Related

How to use multiprocessing pool for Pandas apply function

I want to use pool for Pandas data frames.
I tried as follows, but the following error occurs.
Can't I use pool for Series?
from multiprocessing import pool
split = np.array_split(split,4)
pool = Pool(processes=4)
df = pd.concat(pool.map(split['Test'].apply(lambda x : test(x)), split))
pool.close()
pool.join()
The error message is as follows.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers or slices, not str
Try:
import pandas as pd
import numpy as np
import multiprocessing as mp
def test(x):
return x * 2
if __name__ == '__main__':
# Demo dataframe
df = pd.DataFrame({'Test': range(100)})
# Extract the Series and split into chunk
split = np.array_split(df['Test'], 4)
# Parallel processing
with mp.Pool(4) as pool:
data = pool.map(test, split)
# Concatenate results
out = pd.concat(data)
Output:
>>> df
Test
0 0
1 1
2 2
3 3
4 4
.. ...
95 95
96 96
97 97
98 98
99 99
[100 rows x 1 columns]
>>> out
0 0
1 2
2 4
3 6
4 8
...
95 190
96 192
97 194
98 196
99 198
Name: Test, Length: 100, dtype: int64

Create linear model to check correlation tokenize error

I have data like the sample below, which has 4 continuous columns [x0 to x3] and a binary column y. y has two values 1.0 and 0.0. I’m trying to check for correlation between the binary column y and one of the continuous columns x0, using the CatConCor function below, but I’m getting the error message below. The function creates a linear regression model and calcs the p value for the residuals with and without the categorical variable. If anyone can please point out the issue or how to fix it, it would be very much appreciated.
Data:
x_r x0 x1 x2 x3 y
0 0 0.466726 0.030126 0.998330 0.892770 0.0
1 1 0.173168 0.525810 -0.079341 -0.112151 0.0
2 2 -0.854467 0.770712 0.929614 -0.224779 0.0
3 3 -0.370574 0.568183 -0.928269 0.843253 0.0
4 4 -0.659431 -0.948491 -0.091534 0.706157 0.0
Code:
import numpy as np
import pandas as pd
from time import time
import scipy.stats as stats
from IPython.display import display # Allows the use of display() for DataFrames
# Pretty display for notebooks
%matplotlib inline
###########################################
# Suppress matplotlib user warnings
# Necessary for newer version of matplotlib
import warnings
warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")
#
# Display inline matplotlib plots with IPython
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')
###########################################
import matplotlib.pyplot as plt
import matplotlib.cm as cm
# correlation between categorical variable and continuous variable
def CatConCor(df,catVar,conVar):
import statsmodels.api as sm
from statsmodels.formula.api import ols
# subsetting data for one categorical column and one continuous column
data2=df.copy()[[catVar,conVar]]
data2[catVar]=data2[catVar].astype('category')
mod = ols(conVar+'~'+catVar,
data=data2).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
if aov_table['PR(>F)'][0] < 0.05:
print('Correlated p='+str(aov_table['PR(>F)'][0]))
else:
print('Uncorrelated p='+str(aov_table['PR(>F)'][0]))
# checking for correlation between categorical and continuous variables
CatConCor(df=train_df,catVar='y',conVar='x0')
Error:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-6-80f83b8c8e14> in <module>()
1 # checking for correlation between categorical and continuous variables
2
----> 3 CatConCor(df=train_df,catVar='y',conVar='x0')
<ipython-input-2-35404ba1d697> in CatConCor(df, catVar, conVar)
103
104 mod = ols(conVar+'~'+catVar,
--> 105 data=data2).fit()
106
107 aov_table = sm.stats.anova_lm(mod, typ=2)
~/anaconda2/envs/py36/lib/python3.6/site-packages/statsmodels/base/model.py in from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
153
154 tmp = handle_formula_data(data, None, formula, depth=eval_env,
--> 155 missing=missing)
156 ((endog, exog), missing_idx, design_info) = tmp
157
~/anaconda2/envs/py36/lib/python3.6/site-packages/statsmodels/formula/formulatools.py in handle_formula_data(Y, X, formula, depth, missing)
63 if data_util._is_using_pandas(Y, None):
64 result = dmatrices(formula, Y, depth, return_type='dataframe',
---> 65 NA_action=na_action)
66 else:
67 result = dmatrices(formula, Y, depth, return_type='dataframe',
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/highlevel.py in dmatrices(formula_like, data, eval_env, NA_action, return_type)
308 eval_env = EvalEnvironment.capture(eval_env, reference=1)
309 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 310 NA_action, return_type)
311 if lhs.shape[1] == 0:
312 raise PatsyError("model is missing required outcome variables")
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
163 return iter([data])
164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
60 "ascii-only, or else upgrade to Python 3.")
61 if isinstance(formula_like, str):
---> 62 formula_like = ModelDesc.from_formula(formula_like)
63 # fallthrough
64 if isinstance(formula_like, ModelDesc):
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/desc.py in from_formula(cls, tree_or_string)
162 tree = tree_or_string
163 else:
--> 164 tree = parse_formula(tree_or_string)
165 value = Evaluator().eval(tree, require_evalexpr=False)
166 assert isinstance(value, cls)
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/parse_formula.py in parse_formula(code, extra_operators)
146 tree = infix_parse(_tokenize_formula(code, operator_strings),
147 operators,
--> 148 _atomic_token_types)
149 if not isinstance(tree, ParseNode) or tree.type != "~":
150 tree = ParseNode("~", None, [tree], tree.origin)
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/infix_parser.py in infix_parse(tokens, operators, atomic_types, trace)
208
209 want_noun = True
--> 210 for token in token_source:
211 if c.trace:
212 print("Reading next token (want_noun=%r)" % (want_noun,))
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/parse_formula.py in _tokenize_formula(code, operator_strings)
92 else:
93 it.push_back((pytype, token_string, origin))
---> 94 yield _read_python_expr(it, end_tokens)
95
96 def test__tokenize_formula():
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/parse_formula.py in _read_python_expr(it, end_tokens)
42 origins = []
43 bracket_level = 0
---> 44 for pytype, token_string, origin in it:
45 assert bracket_level >= 0
46 if bracket_level == 0 and token_string in end_tokens:
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/util.py in next(self)
330 else:
331 # May raise StopIteration
--> 332 return six.advance_iterator(self._it)
333 __next__ = next
334
~/anaconda2/envs/py36/lib/python3.6/site-packages/patsy/tokens.py in python_tokenize(code)
33 break
34 origin = Origin(code, start, end)
---> 35 assert pytype not in (tokenize.NL, tokenize.NEWLINE)
36 if pytype == tokenize.ERRORTOKEN:
37 raise PatsyError("error tokenizing input "
AssertionError:
Upgrading patsy to 0.5.1 fixed the issue. I found the tip here:
https://github.com/statsmodels/statsmodels/issues/5343

variance_scaling_initializer() got an unexpected keyword argument 'distribution'

Here i want to predict the same values with time (regression neural network) using python. Here I have two outputs with three inputs. when I run the code it gives me an error "variance_scaling_initializer() got an unexpected keyword argument 'distribution'". Can you help me to solve the problem.?
Here I upload my code,
n_neurons_1 = 24
n_neurons_2 = 12
n_target = 2
softmax = 2
weight_initializer = tf.contrib.layers.variance_scaling_initializer(mode= "FAN_AVG", distribution ="uniform", scale = softmax)
bias_initializer = tf.zeros_initializer()
w_hidden_1 = tf.Variable(weight_initializer([n_time_dimensions,n_neurons_1]))
bias_hidden_1= tf.Variable(bias_initializer([n_neurons_1]))
w_hidden_2= tf.Variable(weight_initializer([n_neurons_1,n_neurons_2]))
bias_hidden_2 = tf.Variable(bias_initializer([n_neurons_2]))
w_out = tf.Variable(weight_initializer([n_neurons_2,2]))
bias_out = tf.Variable(bias_initializer([2]))
hidden_1 = tf.nn.relu(tf.add(tf.matmul(X, w_hidden_1),bias_hidden_1))
hidden_2 = tf.nn.relu(tf.add(tf.matmul(X, w_hidden_2),bias_hidden_2))
out = tf.transpose(tf.add(tf.matmul(hidden_2, w_out),bias_out))
My dataset is,
date time g p c apparentg
6/8/2018 0:06:15 141 131 136 141
6/8/2018 0:09:25 95 117 95 95
6/8/2018 0:11:00 149 109 139 149
6/8/2018 0:13:50 120 103 95 120
6/8/2018 0:16:20 135 97 105 135
6/8/2018 0:19:00 63 NaN 97 63
6/8/2018 0:20:00 111 NaN 100 111
6/8/2018 0:22:10 115 NaN 115 115
6/8/2018 0:23:40 287 NaN NaN 287
error is,
TypeError Traceback (most recent call last)
<ipython-input-26-9ceeb97429b1> in <module>()
31 n_target = 2
32 softmax = 2
---> 33 weight_initializer = tf.contrib.layers.variance_scaling_initializer(mode= "FAN_AVG", distribution ="uniform", scale = softmax)
34 bias_initializer = tf.zeros_initializer()
35 w_hidden_1 = tf.Variable(weight_initializer([n_time_dimensions,n_neurons_1]))
TypeError: variance_scaling_initializer() got an unexpected keyword argument 'distribution'
Looking into Documentation https://www.tensorflow.org/api_docs/python/tf/contrib/layers/variance_scaling_initializer
tf.contrib.layers.variance_scaling_initializer(
factor=2.0,
mode='FAN_IN',
uniform=False,
seed=None,
dtype=tf.float32
)
and
uniform: Whether to use uniform or normal distributed random initialization.
So try
uniform = True
instead of
distribution ="uniform"
in your function call
tf.contrib.layers.variance_scaling_initializer(mode= "FAN_AVG", distribution ="uniform", scale = softmax)
also there seems to be no scale= attribute in that function.

How to get the batch size inside lambda layer

I'm trying to implement a layer (via lambda layer) which is doing the following numpy procedure:
def func(x, n):
return np.concatenate((x[:, :n], np.tile(x[:, n:].mean(axis = 0), (x.shape[0], 1))), axis = 1)
I'm stuck because I don't know how to get the size of the first dimension of x (which is the batch size). The backend function int_shape(x) returns (None, ...).
So, if I know the batch_size, the corresponding Keras procedure would be:
def func(x, n):
return K.concatenate([x[:, :n], K.tile(K.mean(x[:, n:], axis=0), [batch_size, 1])], axis = 1)
Just as #pitfall says, the second argument of K.tile should be a tensor.
And according to the doc of keras backend, K.shape returns a tensor and K.int_shape returns a tuple of int or None entries. So the correct way is to use K.shape. Following is the MWE:
import keras.backend as K
from keras.layers import Input, Lambda
from keras.models import Model
import numpy as np
batch_size = 8
op_len = ip_len = 10
def func(X):
return K.tile(K.mean(X, axis=0, keepdims=True), (K.shape(X)[0], 1))
ip = Input((ip_len,))
lbd = Lambda(lambda x:func(x))(ip)
model = Model(ip, lbd)
model.summary()
model.compile('adam', loss='mse')
X = np.random.randn(batch_size*100, ip_len)
Y = np.random.randn(batch_size*100, op_len)
#no parameters to train!
#model.fit(X,Y,batch_size=batch_size)
#prediction
np_result = np.tile(np.mean(X[:batch_size], axis=0, keepdims=True),
(batch_size,1))
pred_result = model.predict(X[:batch_size])
print(np.allclose(np_result, pred_result))
You should not use K.int_shape, but something like tf.shape, which will give you a dynamic shape.
New updates
Here is a solution without using tile
from keras.layers import *
from keras.models import *
# define the lambda layer
n = 5
MyConcat = Lambda(lambda x: K.concatenate([x[:,:n],
K.ones_like(x[:,n:]) * K.mean(x[:,n:], axis=0)],
axis=1))
# make a dummy testing model
x = Input(shape=(10,))
y = MyConcat(x)
mm = Model(inputs=x, outputs=y)
# test code
a = np.arange(40).reshape([-1,10])
print(a)
[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]]
b = mm.predict(a)
print(b)
[[ 0. 1. 2. 3. 4. 20. 21. 22. 23. 24.]
[10. 11. 12. 13. 14. 20. 21. 22. 23. 24.]
[20. 21. 22. 23. 24. 20. 21. 22. 23. 24.]
[30. 31. 32. 33. 34. 20. 21. 22. 23. 24.]]
The last thing that wants to mention is --- in keras you are not allowed to change the batch-size within a layer, namely the output and input batch size of a layer MUST be the same.
Create a functor and give it the batch size at the initialization.
class SuperLoss:
def __init__(self,batch_size):
self.batch_size = batch_size
def __call__(self,y_true,y_pred):
self.batch_size ....

Adding Predictors to random forest classifier (Pandas, Python3,Sklearn)

I'm trying to learn how to use a random forest generator in Python by adapting code found here: http://blog.yhathq.com/posts/random-forests-in-python.html to a fake data set
I'm trying to predict whether a person is male (0) or female (1) based on their weight and height
Data looks like so:
Weight Height Gender
150 60 1
250 85 0
175 75 0
100 62 1
90 58 1
200 80 0
... ... ...
165 66 0
Now I'm trying to classify the test set into male and female
Here is the code:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
xl = pd.ExcelFile(fakedata.xlsx')
df = xl.parse()
df.head()
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
features = df.columns[:2]
clf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(train['Gender'])
clf.fit(train[features], y)
I understand what this code accomplishes up to here but I run into problems after:
preds = train['Gender'][clf.predict(test[features])]
print(pd.crosstab(test['Gender'], preds, rownames=['actual'], colnames=['preds']))
gives me the error
ValueError: cannot reindex from a duplicate axis
What exactly am I missing?
You shouldn't index by the predictions in your line preds = train['Gender'][clf.predict(test[features])]. Your predictions should simply be
preds = clf.predict(test[features])

Resources