I have a function that produces an output like so when I pass it a name:
W2V('aamir')
array([ 0.12135 , -0.99132 , 0.32347 , 0.31334 , 0.97446 , -0.67629 ,
0.88606 , -0.11043 , 0.79434 , 1.4788 , 0.53169 , 0.95331 ,
-1.1883 , 0.82438 , -0.027177, 0.70081 , 0.87467 , -0.095825,
-0.5937 , 1.4262 , 0.2187 , 1.1763 , 1.6294 , 0.91717 ,
-0.086697, 0.16529 , 0.19095 , -0.39362 , -0.40367 , 0.83966 ,
-0.25251 , 0.46286 , 0.82748 , 0.93061 , 1.136 , 0.85616 ,
0.34705 , 0.65946 , -0.7143 , 0.26379 , 0.64717 , 1.5633 ,
-0.81238 , -0.44516 , -0.2979 , 0.52601 , -0.41725 , 0.086686,
0.68263 , -0.15688 ], dtype=float32)
I have a data frame that has an index Name and a single column Y:
df1
Y
Name
aamir 0
aaron 0
... ...
zulema 1
zuzana 1
I wish to run my function on each value of Name and have it create columns like so:
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
Name
aamir 0.12135 -0.99132 0.32347 0.31334 0.97446 -0.67629 0.88606 -0.11043 0.794340 1.47880 ... 0.647170 1.56330 -0.81238 -0.445160 -0.29790 0.52601 -0.41725 0.086686 0.68263 -0.15688
aaron -1.01850 0.80951 0.40550 0.09801 0.50634 0.22301 -1.06250 -0.17397 -0.061715 0.55292 ... -0.144960 0.82696 -0.51106 -0.072066 0.43069 0.32686 -0.00886 -0.850310 -1.31530 0.71631
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zulema 0.56547 0.30961 0.48725 1.41000 -0.76790 0.39908 0.86915 0.68361 -0.019467 0.55199 ... 0.062091 0.62614 0.44548 -0.193820 -0.80556 -0.73575 -0.30031 -1.278900 0.24759 -0.55541
zuzana -1.49480 -0.15111 -0.21853 0.77911 0.44446 0.95019 0.40513 0.26643 0.075182 -1.34340 ... 1.102800 0.51495 1.06230 -1.587600 -0.44667 1.04600 -0.38978 0.741240 0.39457 0.22857
What I have done is real messy, but works:
names = df1.index.to_list()
Lst = []
for name in names:
Lst.append(W2V(name).tolist())
wv_df = pd.DataFrame(index=names, data=Lst)
wv_df.index.name = "Name"
wv_df.sort_index(inplace=True)
df1 = df1.merge(wv_df, how='inner', left_index=True, right_index=True)
I am hoping there is a way to use .apply() or similar but I have not found how to do this. I am looking for an efficient way.
Update:
I modified my function to do like so:
if isinstance(w, pd.core.series.Series):
w = w.to_string()
Although this appears to work at first, the data is wrong. If I pass aamir to my function you can see the result. Yet when I do it with apply the numbers are totally different:
df1
Name Y
0 aamir 0
1 aaron 0
... ... ...
7942 zulema 1
7943 zuzana 1
df3 = df1.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand')
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
0 0.075014 0.824769 0.580976 0.493415 0.409894 0.142214 0.202602 -0.599501 -0.213184 -0.142188 ... 0.627784 0.136511 -0.162938 0.095707 -0.257638 0.396822 0.208624 -0.454204 0.153140 0.803400
1 0.073664 0.868665 0.574581 0.538951 0.394502 0.134773 0.233070 -0.639365 -0.194892 -0.110557 ... 0.722513 0.147112 -0.239356 -0.046832 -0.237434 0.321494 0.206583 -0.454038 0.251605 0.918388
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7942 -0.002117 0.894570 0.834724 0.602266 0.327858 -0.003092 0.197389 -0.675813 -0.311369 -0.174356 ... 0.690172 -0.085517 -0.000235 -0.214937 -0.290900 0.361734 0.290184 -0.497177 0.285071 0.711388
7943 -0.047621 0.850352 0.729225 0.515870 0.439999 0.060711 0.226026 -0.604846 -0.344891 -0.128396 ... 0.557035 -0.048322 -0.070075 -0.265775 -0.330709 0.281492 0.304157 -0.552191 0.281502 0.750304
7944 rows × 50 columns
You can see that the first row is aamir and the first value (column 0) my function returns is 0.1213 (You can see this at the top of my post). Yet with apply that appears to be 0.075014
EDIT:
It appears it passes in Name aamir rather than aamir. How can I get it to just send the Name itself aamir?
Let's say we have some function which transforms a string into a vector of a fixed size, for example:
import numpy as np
def W2V(name: str) -> np.ndarray:
low, high, size = 0, 5, 10
rng = np.random.default_rng(abs(hash(name)))
return rng.integers(low, high, size, endpoint=True)
Also a data frame is given with a meaningful index and junk data:
import pandas as pd
names = pd.Index(['aamir','aaron','zulema','zuzana'], name='Name')
df = pd.DataFrame(index=names).assign(Y=0)
When we apply some function to a DataFrame along columns, i.e. axis=1, its argument is gonna be a row as Series wich name is an index of the row. So we could do something like this:
output = df.apply(lambda row: W2V(row.name), axis=1, result_type='expand')
With result_type='expand', returned vectors will be transformed into columns, which is the required output.
P.S. As an option:
df = pd.DataFrame.from_dict({n: W2V(n) for n in names}, orient='index')
P.P.S. IMO The behavior you describe means that your function can operate not only on str, but also on some common sequence, for example on a Series of strings. In case of the code:
df.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand')
the function W2V receives not "a name" as a string but pd.Series(["a name"]). If we do not check the type of the passed parameter inside the function, then we can get a silent error, which in this case appears as different output data.
I don't know if this is any better than the other suggestions, but I would use apply to create another n-column dataframe (where n is the length of the array returned by the W2V function) and then concatenate it to the original dataframe.
This first section generates toy versions of your W2V function and your dataframe.
# substitute your W2V function for this:
n = 5
def W2V(name: str):
return [random() for i in range(n)]
# substitute your 2-column dataframe for this:
df1 = pd.DataFrame(data={'Name':['aamir', 'aaron', 'zulema', 'zuzana'],
'Y': [0, 0, 1, 1]},
index=list(range(4)))
df1 is
Name Y
0 aamir 0
1 aaron 0
2 zulema 1
3 zuzana 1
You want to make a second dataframe that applies W2V to every name in the first dataframe. To generate your column numbers, I'm just using a list comprehension that generates [0, 1, ... n], where n is the length of the array returned by W2V.
df2 = df1.apply(lambda x: pd.Series(W2V(x['Name']),
index=[i for i in range(n)]),
axis=1)
My random-valued df2 is
0 1 2 3 4
0 0.242761 0.415253 0.940213 0.074455 0.444372
1 0.935781 0.968155 0.850091 0.064548 0.737655
2 0.204053 0.845252 0.967767 0.352254 0.028609
3 0.853164 0.698195 0.292238 0.982009 0.402736
Then concatenate the new dataframe to the old one:
df3 = pd.concat([df1, df2], axis=1)
df3 is
Name Y 0 1 2 3 4
0 aamir 0 0.242761 0.415253 0.940213 0.074455 0.444372
1 aaron 0 0.935781 0.968155 0.850091 0.064548 0.737655
2 zulema 1 0.204053 0.845252 0.967767 0.352254 0.028609
3 zuzana 1 0.853164 0.698195 0.292238 0.982009 0.402736
Alternatively, you could do both steps in one line as:
df1 = pd.concat([df1,
df1.apply(lambda x: pd.Series(W2V(x['Name']),
index=[i for i in range(n)]),
axis=1)],
axis=1)
You can try something like this using map and np.vstack with a dataframe constructor then join:
df.join(pd.DataFrame(np.vstack(df.index.map(W2V)), index=df.index))
Output:
Y 0 1 2 3 4 5 6 7 8 9
A 0 4 0 2 1 0 0 0 0 3 3
B 1 4 0 0 4 4 3 4 3 4 3
C 2 1 5 5 5 3 3 1 3 5 0
D 3 3 5 1 3 4 2 3 1 0 1
E 4 4 0 2 4 4 0 3 3 4 2
F 5 4 3 5 1 0 2 3 2 5 2
G 6 4 5 2 0 0 2 4 3 4 3
H 7 0 2 5 2 3 4 3 5 3 1
I 8 2 2 0 1 4 2 4 1 0 4
J 9 0 2 3 5 0 3 0 2 4 0
Using #Vitalizzare function:
def W2V(name: str) -> np.ndarray:
low, high, size = 0, 5, 10
rng = np.random.default_rng(abs(hash(name)))
return rng.integers(low, high, size, endpoint=True)
df = pd.DataFrame({'Y': np.arange(10)}, index = [*'ABCDEFGHIJ'])
I am going off the names being the axis, and there being a useless column called 0. I think this may be the solution, no way to know without your function or the names
df.reset_index().drop(0, axis=1).apply(my_func, axis=1, result_type='expand')
I would do simply:
newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index)
Example
To start with, let us make some function w2v(name). In the following, we compute a consistent hash of any string. Then we use that hash as a (temporary) seed for np.random, and then draw a random vector size=50:
import numpy as np
import pandas as pd
from contextlib import contextmanager
#contextmanager
def temp_seed(seed):
state = np.random.get_state()
np.random.seed(seed)
try:
yield
finally:
np.random.set_state(state)
mask = (1 << 32) - 1
def w2v(name, size=50):
fingerprint = int(pd.util.hash_array(np.array([name])))
with temp_seed(fingerprint & mask):
return np.random.uniform(-1, 1, size)
For instance:
>>> w2v('aamir')
array([ 0.65446901, -0.92765123, -0.78188552, -0.62683782, -0.23946784,
0.31315156, 0.22802972, -0.96076167, 0.62577993, -0.59024811,
0.76365736, 0.93033898, -0.56155296, 0.4760905 , -0.92760642,
0.00177959, -0.22761559, 0.81929959, 0.21138229, -0.49882747,
-0.97637984, -0.19452496, -0.91354933, 0.70473533, -0.30394358,
-0.47092087, -0.0329302 , -0.93178517, 0.79118799, 0.98286834,
-0.16024194, -0.02793147, -0.52251214, -0.70732759, 0.10098142,
-0.24880249, 0.28930319, -0.53444863, 0.37887522, 0.58544068,
0.85804119, 0.67048213, 0.58389158, -0.19889071, -0.04281131,
-0.62506126, 0.42872395, -0.12821543, -0.52458052, -0.35493892])
Now, we use the expression given as solution:
df = pd.DataFrame([0,0,1,1], index=['aamir', 'aaron', 'zulema', 'zuzana'])
newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index)
>>> newdf
0 1 2 3 4 5 6 ...
aamir 0.654469 -0.927651 -0.781886 -0.626838 -0.239468 0.313152 0.228030 ...
aaron -0.380524 -0.850608 -0.914642 -0.578885 0.177975 -0.633761 -0.736234 ...
zulema -0.250957 0.882491 -0.197833 -0.707652 0.754575 0.731236 -0.770831 ...
zuzana -0.641296 0.065898 0.466784 0.652776 0.391865 0.918761 0.022798 ...
i want replace the append() with the concat() from pandas. But when i try to replace my output is different. Thank you
old with append():
def gettrigger(self):
dfx = pd.DataFrame()
for i in range(self.lags +1):
mask = (self.df["%K"].shift(i) < 20) & (self.df["%D"].shift(i) < 20)
dfx = dfx.append(mask, ignore_index=True)
return dfx.sum(axis=0)
output with append()
new with pd.concat():
def gettrigger(self):
dfx = pd.DataFrame()
for i in range(self.lags +1):
mask = (self.df["%K"].shift(i) < 20) & (self.df["%D"].shift(i) < 20)
#dfx = dfx.append(mask, ignore_index=True)
dfx = pd.concat([dfx, mask], ignore_index=True)
return dfx.sum(axis=0)
output with pd.concat()
While the append method is appending data as rows to your DataFrame, the concat method is appending data as columns
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,4))
df
0 1 2 3
0 0.403637 -0.204563 -2.070799 0.759681
1 -0.684890 -0.651969 1.028248 0.129635
2 -1.011895 1.877984 -0.724938 0.869389
3 -1.344031 0.106898 1.762096 0.088816
4 -0.382107 1.822279 0.435752 -2.573199
5 -1.173345 -1.224242 0.887549 -0.816519
6 -1.269713 0.201384 1.576388 -1.355996
7 -0.459533 0.961355 -0.280733 -0.026496
8 0.522513 0.266246 1.066807 -0.232884
9 -0.688417 0.181908 0.356574 -0.245040
dfx = pd.DataFrame()
for i in range(df.shape[1]): # Iterating through columns
df_new = df.loc[:, i] # loc column
dfx = dfx.append(df_new) # append column as new row
dfx
0 1 2 3 4 5 6 7 8 9
0 0.403637 -0.684890 -1.011895 -1.344031 -0.382107 -1.173345 -1.269713 -0.459533 0.522513 -0.688417
1 -0.204563 -0.651969 1.877984 0.106898 1.822279 -1.224242 0.201384 0.961355 0.266246 0.181908
2 -2.070799 1.028248 -0.724938 1.762096 0.435752 0.887549 1.576388 -0.280733 1.066807 0.356574
3 0.759681 0.129635 0.869389 0.088816 -2.573199 -0.816519 -1.355996 -0.026496 -0.232884 -0.245040
dfx = pd.DataFrame()
for i in range(df.shape[0]): # Iterating through rows
df_new = df.loc[i, :] # loc row
dfx = pd.concat([dfx, df_new], axis=1) # append row as column (axis=1)
dfx
0 1 2 3 4 5 6 7 8 9
0 0.403637 -0.684890 -1.011895 -1.344031 -0.382107 -1.173345 -1.269713 -0.459533 0.522513 -0.688417
1 -0.204563 -0.651969 1.877984 0.106898 1.822279 -1.224242 0.201384 0.961355 0.266246 0.181908
2 -2.070799 1.028248 -0.724938 1.762096 0.435752 0.887549 1.576388 -0.280733 1.066807 0.356574
3 0.759681 0.129635 0.869389 0.088816 -2.573199 -0.816519 -1.355996 -0.026496 -0.232884 -0.245040
I have a dataframe with four columns and want to pass the elements in one column (e.g. column b) through a function, and then add the result back into the dataframe as a new, fifth column. The problem is, that the output is only one number for example [6.782468322626846]. But I would expect 4 numbers. What am I doing wrong?
The DataFrame df_prices looks something like this:
a b c d
0 354.24 322.02 10 3.729
1 352.04 320.04 10 3.906
2 349.98 318.17 10 4.072
3 347.82 316.20 10 4.246
The function needs to take column b and pass it to a function.
interest = 0.02
mean_end_price = 400
T = 1
exchange = 1.1199
def func(b):
for i in b:
b = b * (1 + interest) * np.sqrt(T)
nvp_price = ((mean_end_price - b)/(1 + interest) * np.sqrt(T))/(ratio * exchange)
output.append(nvp_price)
df_prices['e'] = func(df_prices['b'])
This can be solved using numpy and vectorization, see example below:
import pandas
import numpy as np
# a b c d
# 0 354.24 322.02 10 3.729
# 1 352.04 320.04 10 3.906
# 2 349.98 318.17 10 4.072
# 3 347.82 316.20 10 4.246
df_prices = pandas.DataFrame(columns=['a','b','c','d'], index=['0','1','2','3'])
df_prices.loc['0'] = [354.24, 322.02, 10, 3.729]
df_prices.loc['1'] = [352.04 , 320.04 , 10 , 3.906]
df_prices.loc['2'] = [349.98 , 318.17 , 10 , 4.072]
df_prices.loc['3'] = [347.82 , 316.20 , 10 ,4.246]
print(df_prices)
interest = 0.02
mean_end_price = 400
T = 1
exchange = 1.1199
ratio = .5
def func(b):
b = np.asarray(b)
b = b * (1 + interest) * np.sqrt(T)
nvp_price = ((mean_end_price - b)/(1 + interest) * np.sqrt(T))/(ratio * exchange)
return nvp_price
df_prices['e'] = list(func(df_prices['b']))
print(df_prices)
try to use apply
df_prices['e'] = df_prices['b'].apply(lambda x:func(x))
I have dataframe which I've referenced as df in the code and I'm applying aggregate functions on multiple columns of each group. I also applied user-defined lambda functions f4, f5, f6, f7. Some functions are very similar like f4, f6 and f7 where only parameter value are different. Can I pass these parameters from dictionary d, so that I have to write only one function instead of writing multiple functions?
f4 = lambda x: len(x[x>10]) # count the frequency of bearing greater than threshold value
f4.__name__ = 'Frequency'
f5 = lambda x: len(x[x<3.4]) # count the stop points with velocity less than threshold value 3.4
f5.__name__ = 'stop_frequency'
f6 = lambda x: len(x[x>0.2]) # count the points with velocity greater than threshold value 0.2
f6.__name__ = 'frequency'
f7 = lambda x: len(x[x>0.25]) # count the points with accelration greater than threshold value 0.25
f7.__name__ = 'frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f6,
'acc_rate':f7,
'bearing':['sum', f4],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
I like to write a function like
f4(p) = lambda x: len(x[x>p])
f4.__name__ = 'Frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(0.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
The csv file of dataframe df is available at given link for more clarity of data.
https://drive.google.com/open?id=1R_BBL00G_Dlo-6yrovYJp5zEYLwlMPi9
It is possible, but not easy, solution by neilaronson.
Also solution is simplify by sum of True values of boolean mask.
def f4(p):
def ipf(x):
return (x < p).sum()
#your solution
#return len(x[x < p])
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
EDIT: You can also pass parameter for greater or less:
def f4(p, op):
def ipf(x):
if op == 'greater':
return (x > p).sum()
elif op == 'less':
return (x < p).sum()
else:
raise ValueError("second argument has to be greater or less only")
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4, 'less'), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2, 'greater'),
'acc_rate':f4(.25, 'greater'),
'bearing':['sum', f4(10, 'greater')],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
print (df1.head())
userid trip_id segmentid Transportation_Mode acceleration_mean \
0 141 1.0 1 walk 0.061083
1 141 2.0 1 walk 0.109148
2 141 3.0 1 walk 0.106771
3 141 4.0 1 walk 0.141180
4 141 5.0 1 walk 1.147157
acceleration_median acceleration_min velocity_Frequency velocity_sum \
0 -1.168583e-02 -2.994428 1000.0 1506.679506
1 1.665535e-09 -3.234188 464.0 712.429005
2 -3.055414e-08 -3.131293 996.0 1394.746071
3 9.241707e-09 -3.307262 340.0 513.461259
4 -2.609489e-02 -3.190424 493.0 729.702854
velocity_count velocity_median velocity_min velocity_rate_Frequency \
0 1028 1.294657 0.284747 288.0
1 486 1.189650 0.284725 134.0
2 1020 1.241419 0.284733 301.0
3 352 1.326324 0.339590 93.0
4 504 1.247868 0.284740 168.0
acc_rate_Frequency bearing_sum bearing_Frequency bearing_rate_sum \
0 169.0 81604.187066 884.0 -371.276356
1 89.0 25559.589869 313.0 -357.869944
2 203.0 -71540.141199 57.0 946.382581
3 78.0 9548.920765 167.0 -943.184805
4 93.0 -24021.555784 67.0 535.333624
Vincenty_distance_sum
0 1506.679506
1 712.429005
2 1395.328768
3 513.461259
4 731.823664
I have a pandas dataframe sorted by a number of columns. Now I'd like to split the dataframe in predefined percentages, so as to extract and name a few segments.
For example, I want to take the first 20% of rows to create the first segment, then the next 30% for the second segment and leave the remaining 50% to the third segment.
How would I achieve that?
Use numpy.split:
a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)
a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
A B C D E
0 0.543405 0.278369 0.424518 0.844776 0.004719
1 0.121569 0.670749 0.825853 0.136707 0.575093
2 0.891322 0.209202 0.185328 0.108377 0.219697
3 0.978624 0.811683 0.171941 0.816225 0.274074
print (b)
A B C D E
4 0.431704 0.940030 0.817649 0.336112 0.175410
5 0.372832 0.005689 0.252426 0.795663 0.015255
6 0.598843 0.603805 0.105148 0.381943 0.036476
7 0.890412 0.980921 0.059942 0.890546 0.576901
8 0.742480 0.630184 0.581842 0.020439 0.210027
9 0.544685 0.769115 0.250695 0.285896 0.852395
print (c)
A B C D E
10 0.975006 0.884853 0.359508 0.598859 0.354796
11 0.340190 0.178081 0.237694 0.044862 0.505431
12 0.376252 0.592805 0.629942 0.142600 0.933841
13 0.946380 0.602297 0.387766 0.363188 0.204345
14 0.276765 0.246536 0.173608 0.966610 0.957013
15 0.597974 0.731301 0.340385 0.092056 0.463498
16 0.508699 0.088460 0.528035 0.992158 0.395036
17 0.335596 0.805451 0.754349 0.313066 0.634037
18 0.540405 0.296794 0.110788 0.312640 0.456979
19 0.658940 0.254258 0.641101 0.200124 0.657625
Creating a dataframe with 70% values of original dataframe
part_1 = df.sample(frac = 0.7)
Creating dataframe with rest of the 30% values
part_2 = df.drop(part_1.index)
I've written a simple function that does the job.
Maybe that might help you.
P.S:
Sum of fractions must be 1.
It will return len(fracs) new dfs. so you can insert fractions list at long as you want (e.g: fracs=[0.1, 0.1, 0.3, 0.2, 0.2])
np.random.seed(100)
df = pd.DataFrame(np.random.random((99,4)))
def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42):
assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs))
remain = df.index.copy().to_frame()
res = []
for i in range(len(fracs)):
fractions_sum=sum(fracs[i:])
frac = fracs[i]/fractions_sum
idxs = remain.sample(frac=frac, random_state=random_state).index
remain=remain.drop(idxs)
res.append(idxs)
return [df.loc[idxs] for idxs in res]
train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation]
print(train.shape, test.shape, val.shape)
outputs:
(79, 4) (10, 4) (10, 4)