I have a function that produces an output like so when I pass it a name:
W2V('aamir')
array([ 0.12135 , -0.99132 , 0.32347 , 0.31334 , 0.97446 , -0.67629 ,
0.88606 , -0.11043 , 0.79434 , 1.4788 , 0.53169 , 0.95331 ,
-1.1883 , 0.82438 , -0.027177, 0.70081 , 0.87467 , -0.095825,
-0.5937 , 1.4262 , 0.2187 , 1.1763 , 1.6294 , 0.91717 ,
-0.086697, 0.16529 , 0.19095 , -0.39362 , -0.40367 , 0.83966 ,
-0.25251 , 0.46286 , 0.82748 , 0.93061 , 1.136 , 0.85616 ,
0.34705 , 0.65946 , -0.7143 , 0.26379 , 0.64717 , 1.5633 ,
-0.81238 , -0.44516 , -0.2979 , 0.52601 , -0.41725 , 0.086686,
0.68263 , -0.15688 ], dtype=float32)
I have a data frame that has an index Name and a single column Y:
df1
Y
Name
aamir 0
aaron 0
... ...
zulema 1
zuzana 1
I wish to run my function on each value of Name and have it create columns like so:
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
Name
aamir 0.12135 -0.99132 0.32347 0.31334 0.97446 -0.67629 0.88606 -0.11043 0.794340 1.47880 ... 0.647170 1.56330 -0.81238 -0.445160 -0.29790 0.52601 -0.41725 0.086686 0.68263 -0.15688
aaron -1.01850 0.80951 0.40550 0.09801 0.50634 0.22301 -1.06250 -0.17397 -0.061715 0.55292 ... -0.144960 0.82696 -0.51106 -0.072066 0.43069 0.32686 -0.00886 -0.850310 -1.31530 0.71631
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zulema 0.56547 0.30961 0.48725 1.41000 -0.76790 0.39908 0.86915 0.68361 -0.019467 0.55199 ... 0.062091 0.62614 0.44548 -0.193820 -0.80556 -0.73575 -0.30031 -1.278900 0.24759 -0.55541
zuzana -1.49480 -0.15111 -0.21853 0.77911 0.44446 0.95019 0.40513 0.26643 0.075182 -1.34340 ... 1.102800 0.51495 1.06230 -1.587600 -0.44667 1.04600 -0.38978 0.741240 0.39457 0.22857
What I have done is real messy, but works:
names = df1.index.to_list()
Lst = []
for name in names:
Lst.append(W2V(name).tolist())
wv_df = pd.DataFrame(index=names, data=Lst)
wv_df.index.name = "Name"
wv_df.sort_index(inplace=True)
df1 = df1.merge(wv_df, how='inner', left_index=True, right_index=True)
I am hoping there is a way to use .apply() or similar but I have not found how to do this. I am looking for an efficient way.
Update:
I modified my function to do like so:
if isinstance(w, pd.core.series.Series):
w = w.to_string()
Although this appears to work at first, the data is wrong. If I pass aamir to my function you can see the result. Yet when I do it with apply the numbers are totally different:
df1
Name Y
0 aamir 0
1 aaron 0
... ... ...
7942 zulema 1
7943 zuzana 1
df3 = df1.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand')
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
0 0.075014 0.824769 0.580976 0.493415 0.409894 0.142214 0.202602 -0.599501 -0.213184 -0.142188 ... 0.627784 0.136511 -0.162938 0.095707 -0.257638 0.396822 0.208624 -0.454204 0.153140 0.803400
1 0.073664 0.868665 0.574581 0.538951 0.394502 0.134773 0.233070 -0.639365 -0.194892 -0.110557 ... 0.722513 0.147112 -0.239356 -0.046832 -0.237434 0.321494 0.206583 -0.454038 0.251605 0.918388
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7942 -0.002117 0.894570 0.834724 0.602266 0.327858 -0.003092 0.197389 -0.675813 -0.311369 -0.174356 ... 0.690172 -0.085517 -0.000235 -0.214937 -0.290900 0.361734 0.290184 -0.497177 0.285071 0.711388
7943 -0.047621 0.850352 0.729225 0.515870 0.439999 0.060711 0.226026 -0.604846 -0.344891 -0.128396 ... 0.557035 -0.048322 -0.070075 -0.265775 -0.330709 0.281492 0.304157 -0.552191 0.281502 0.750304
7944 rows × 50 columns
You can see that the first row is aamir and the first value (column 0) my function returns is 0.1213 (You can see this at the top of my post). Yet with apply that appears to be 0.075014
EDIT:
It appears it passes in Name aamir rather than aamir. How can I get it to just send the Name itself aamir?
Let's say we have some function which transforms a string into a vector of a fixed size, for example:
import numpy as np
def W2V(name: str) -> np.ndarray:
low, high, size = 0, 5, 10
rng = np.random.default_rng(abs(hash(name)))
return rng.integers(low, high, size, endpoint=True)
Also a data frame is given with a meaningful index and junk data:
import pandas as pd
names = pd.Index(['aamir','aaron','zulema','zuzana'], name='Name')
df = pd.DataFrame(index=names).assign(Y=0)
When we apply some function to a DataFrame along columns, i.e. axis=1, its argument is gonna be a row as Series wich name is an index of the row. So we could do something like this:
output = df.apply(lambda row: W2V(row.name), axis=1, result_type='expand')
With result_type='expand', returned vectors will be transformed into columns, which is the required output.
P.S. As an option:
df = pd.DataFrame.from_dict({n: W2V(n) for n in names}, orient='index')
P.P.S. IMO The behavior you describe means that your function can operate not only on str, but also on some common sequence, for example on a Series of strings. In case of the code:
df.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand')
the function W2V receives not "a name" as a string but pd.Series(["a name"]). If we do not check the type of the passed parameter inside the function, then we can get a silent error, which in this case appears as different output data.
I don't know if this is any better than the other suggestions, but I would use apply to create another n-column dataframe (where n is the length of the array returned by the W2V function) and then concatenate it to the original dataframe.
This first section generates toy versions of your W2V function and your dataframe.
# substitute your W2V function for this:
n = 5
def W2V(name: str):
return [random() for i in range(n)]
# substitute your 2-column dataframe for this:
df1 = pd.DataFrame(data={'Name':['aamir', 'aaron', 'zulema', 'zuzana'],
'Y': [0, 0, 1, 1]},
index=list(range(4)))
df1 is
Name Y
0 aamir 0
1 aaron 0
2 zulema 1
3 zuzana 1
You want to make a second dataframe that applies W2V to every name in the first dataframe. To generate your column numbers, I'm just using a list comprehension that generates [0, 1, ... n], where n is the length of the array returned by W2V.
df2 = df1.apply(lambda x: pd.Series(W2V(x['Name']),
index=[i for i in range(n)]),
axis=1)
My random-valued df2 is
0 1 2 3 4
0 0.242761 0.415253 0.940213 0.074455 0.444372
1 0.935781 0.968155 0.850091 0.064548 0.737655
2 0.204053 0.845252 0.967767 0.352254 0.028609
3 0.853164 0.698195 0.292238 0.982009 0.402736
Then concatenate the new dataframe to the old one:
df3 = pd.concat([df1, df2], axis=1)
df3 is
Name Y 0 1 2 3 4
0 aamir 0 0.242761 0.415253 0.940213 0.074455 0.444372
1 aaron 0 0.935781 0.968155 0.850091 0.064548 0.737655
2 zulema 1 0.204053 0.845252 0.967767 0.352254 0.028609
3 zuzana 1 0.853164 0.698195 0.292238 0.982009 0.402736
Alternatively, you could do both steps in one line as:
df1 = pd.concat([df1,
df1.apply(lambda x: pd.Series(W2V(x['Name']),
index=[i for i in range(n)]),
axis=1)],
axis=1)
You can try something like this using map and np.vstack with a dataframe constructor then join:
df.join(pd.DataFrame(np.vstack(df.index.map(W2V)), index=df.index))
Output:
Y 0 1 2 3 4 5 6 7 8 9
A 0 4 0 2 1 0 0 0 0 3 3
B 1 4 0 0 4 4 3 4 3 4 3
C 2 1 5 5 5 3 3 1 3 5 0
D 3 3 5 1 3 4 2 3 1 0 1
E 4 4 0 2 4 4 0 3 3 4 2
F 5 4 3 5 1 0 2 3 2 5 2
G 6 4 5 2 0 0 2 4 3 4 3
H 7 0 2 5 2 3 4 3 5 3 1
I 8 2 2 0 1 4 2 4 1 0 4
J 9 0 2 3 5 0 3 0 2 4 0
Using #Vitalizzare function:
def W2V(name: str) -> np.ndarray:
low, high, size = 0, 5, 10
rng = np.random.default_rng(abs(hash(name)))
return rng.integers(low, high, size, endpoint=True)
df = pd.DataFrame({'Y': np.arange(10)}, index = [*'ABCDEFGHIJ'])
I am going off the names being the axis, and there being a useless column called 0. I think this may be the solution, no way to know without your function or the names
df.reset_index().drop(0, axis=1).apply(my_func, axis=1, result_type='expand')
I would do simply:
newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index)
Example
To start with, let us make some function w2v(name). In the following, we compute a consistent hash of any string. Then we use that hash as a (temporary) seed for np.random, and then draw a random vector size=50:
import numpy as np
import pandas as pd
from contextlib import contextmanager
#contextmanager
def temp_seed(seed):
state = np.random.get_state()
np.random.seed(seed)
try:
yield
finally:
np.random.set_state(state)
mask = (1 << 32) - 1
def w2v(name, size=50):
fingerprint = int(pd.util.hash_array(np.array([name])))
with temp_seed(fingerprint & mask):
return np.random.uniform(-1, 1, size)
For instance:
>>> w2v('aamir')
array([ 0.65446901, -0.92765123, -0.78188552, -0.62683782, -0.23946784,
0.31315156, 0.22802972, -0.96076167, 0.62577993, -0.59024811,
0.76365736, 0.93033898, -0.56155296, 0.4760905 , -0.92760642,
0.00177959, -0.22761559, 0.81929959, 0.21138229, -0.49882747,
-0.97637984, -0.19452496, -0.91354933, 0.70473533, -0.30394358,
-0.47092087, -0.0329302 , -0.93178517, 0.79118799, 0.98286834,
-0.16024194, -0.02793147, -0.52251214, -0.70732759, 0.10098142,
-0.24880249, 0.28930319, -0.53444863, 0.37887522, 0.58544068,
0.85804119, 0.67048213, 0.58389158, -0.19889071, -0.04281131,
-0.62506126, 0.42872395, -0.12821543, -0.52458052, -0.35493892])
Now, we use the expression given as solution:
df = pd.DataFrame([0,0,1,1], index=['aamir', 'aaron', 'zulema', 'zuzana'])
newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index)
>>> newdf
0 1 2 3 4 5 6 ...
aamir 0.654469 -0.927651 -0.781886 -0.626838 -0.239468 0.313152 0.228030 ...
aaron -0.380524 -0.850608 -0.914642 -0.578885 0.177975 -0.633761 -0.736234 ...
zulema -0.250957 0.882491 -0.197833 -0.707652 0.754575 0.731236 -0.770831 ...
zuzana -0.641296 0.065898 0.466784 0.652776 0.391865 0.918761 0.022798 ...
I have a df describing transactions like
transaction start_in_s_since_epoch duration_in_s charged_energy_in_wh
1 1.457423e+09 1821.0 1732
2 1.457389e+09 35577.0 18397
3 1.457425e+09 2.0 0
[...]
I assume the charged_energy is linear through the transaction. I would like to transform it to a time series with the granularity of a day. charged_energy within a day should be summed up as well as duration.
day sum_duration_in_s sum_charged_energy_in_wh
2016-03-16 00:00 123 456
2016-03-17 00:00 456 789
2016-03-18 00:00 789 012
[...]
Any idea? I am struggling with the borders between days. This transaction with
transaction start_in_s_since_epoch duration_in_s charged_energy_in_wh
500 1620777300 600 1000
should be equally divided to
day sum_duration_in_s sum_charged_energy_in_wh
2021-05-11 00:00 300 500
2021-05-11 00:00 300 500
This did it for me. Slow af but works:
from datetime import datetime
from datetime_truncate import truncate
df_tmp = pd.DataFrame()
for index, row in df.iterrows():
day_in_s = 60*60*24
start = row.start_in_s_since_epoch
time = row.duration_in_s
energy_per_s = row.charged_energy_in_wh / row.duration_in_s
till_midnight_in_s = truncate(pd.to_datetime(start + day_in_s, unit='s'), 'day').timestamp() - start
rest_in_s = time - till_midnight_in_s
data = {'day':truncate(pd.to_datetime(start, unit='s'), 'day'),
'sum_duration_in_s':min(time, till_midnight_in_s),
'sum_charged_energy_in_wh':min(time, till_midnight_in_s) * energy_per_s}
df_tmp = df_tmp.append(data, ignore_index=True)
while rest_in_s > 0:
start += day_in_s
data = {'day':truncate(pd.to_datetime(start, unit='s'), 'day'),
'sum_duration_in_s':min(rest_in_s, day_in_s),
'sum_charged_energy_in_wh':min(rest_in_s, day_in_s) * energy_per_s}
df_tmp = df_tmp.append(data, ignore_index=True)
rest_in_s = rest_in_s - day_in_s
df_ts = df_tmp.groupby(['date']).agg({'sum_charged_energy_in_wh':sum,
'sum_duration_in_s':sum}).sort_values('date')
df_ts = df_ts.asfreq('D', fill_value=0)
Im trying to create a comparison between my predicted and actual values.
Here is my try:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df[['Op1', 'Op2', 'S2', 'S3', 'S4', 'S7', 'S8', 'S9', 'S11', 'S12','S13', 'S14', 'S15', 'S17', 'S20', 'S21']], df.unit)
predicted = []
actual = []
for i in range(1,len(df.unit.unique())):
xp = df[(df.unit == i) & (df.cycles == len(df[df.unit == i].cycles))]
xa = xp.cycles.values
xp = xp.values[0,2:].reshape(1,-2)
predicted.append(reg.predict(xp))
actual.append(xa)
and to display the dataframe:
data = {'Actual cycles': actual, 'Predicted cycles': predicted }
df_2 = pd.DataFrame(data)
df_2.head()
I will get an output:
Actual cycles Predicted cycles
0 [192] [56.7530579842869]
1 [287] [50.76877712361329]
2 [179] [42.72575900074571]
3 [189] [42.876506912637524]
4 [269] [47.40087182743173]
ignoring the values that are way off, how do I remove the square brackets in the dataframe? and is there a neater way to write my code? Thank you!
print(df_2)
Actualcycles Predictedcycles
0 [192] [56.7530579842869]
1 [287] [50.76877712361329]
2 [179] [42.72575900074571]
3 [189] [42.876506912637524]
4 [269] [47.40087182743173]
df=df_2.apply(lambda x:x.str.strip('[]'))
print(df)
Actualcycles Predictedcycles
0 192 56.7530579842869
1 287 50.76877712361329
2 179 42.72575900074571
3 189 42.876506912637524
4 269 47.40087182743173
Below is a minimal example of your cycles column with brackets:
import pandas as pd
df = pd.DataFrame({
'cycles' : [[192], [287], [179], [189], [269]]
})
This code gets you the column without brackets:
df['cycles'] = df['cycles'].str[0]
The output looks like this:
print(df)
cycles
0 192
1 287
2 179
3 189
4 269
I am trying to build a function which transform a dataframe based on certain conditions but I am getting a Systax Error. I am not sure what I am doing wrong. Any help will be appreciated. Thank you!
import pandas as pd
from datetime import datetime
from datetime import timedelta
df=pd.read_csv('example1.csv')
df.columns =(['dtime','kW'])
df['dtime'] = pd.to_datetime(df['dtime'])
df.head(5)
dtime kW
0 2019-08-27 23:30:00 0.016
1 2019-08-27 23:00:00 0
2 2019-08-27 22:30:00 0.016
3 2019-08-27 22:00:00 0.016
4 2019-08-27 21:30:00 0
def transdf(df):
a=df.loc[0,'dtime']
b=df.loc[1,'dtime']
c=a-b
minutes = c.total_seconds() / 60
d=int(minutes)
#d can be only 15 ,30 or 60
if d==15:
return df=df.set_index('dtime').asfreq('-15T',fill_value='Missing')
elif d==30:
return df=df.set_index('dtime').asfreq('-30T',fill_value='Missing')
elif d==60:
return df=df.set_index('dtime').asfreq('-60T',fill_value='Missing')
else:
return None
first. It is more efficient to have the return statement after the else at the end of your code. Inside each of the cases just update the value for df. Return is part of your function, not the if statement that's why you are getting errors.
def transform(df):
a = df.loc[0, 'dtime']
b = df.loc[1, 'dtime']
c = a - b
minutes = c.total_seconds() / 60
d=int(minutes)
#d can be only 15 ,30 or 60
if d==15:
df= df.set_index('dtime').asfreq('-15T',fill_value='Missing')
elif d==30:
df= df.set_index('dtime').asfreq('-30T',fill_value='Missing')
elif d==60:
df= df.set_index('dtime').asfreq('-60T',fill_value='Missing')
else:
None
return dfere
I am using the following code to create a few new aggregated columns based on the column version. Then merged the 4 new data frames.
new_df = df[['version','duration']].groupby('version').mean().rename(columns=lambda x: ('mean_' + x)).reset_index().fillna(0)
new_df1 = df[['version','duration']].groupby('version').std().rename(columns=lambda x: ('std_' + x)).reset_index().fillna(0)
new_df2 = df[['version','ts']].groupby('version').min().rename(columns=lambda x: ('min_' + x)).reset_index().fillna(0)
new_df3 = df[['version','ts']].groupby('version').max().rename(columns=lambda x: ('max_' + x)).reset_index().fillna(0)
new_df3
import pandas
df_a = pandas.merge(new_df,new_df1, on = 'version')
df_b = pandas.merge(df_a,new_df2, on = 'version')
df_c = pandas.merge(df_b,new_df3, on = 'version')
df_c
The output looks like below:
version mean_duration std_duration min_ts max_ts
0 1400422 451 1 2018-02-28 09:42:15 2018-02-28 09:42:15
1 7626065 426 601 2018-01-25 11:01:58 2018-01-25 11:15:22
2 7689209 658 473 2018-01-30 11:09:31 2018-02-01 05:19:23
3 7702304 711 80 2018-01-30 17:49:18 2018-01-31 12:27:20
The code works fine, but I am wondering is there a more elegant/clean way to do this? Thank you!
Using functools reduce modify your result (merge)
import functools
l=[new_df1,new_df3,new_df3]
functools.reduce(lambda left,right: pd.merge(left,right,on=['version']), l)
Or let us using agg recreate what you need
s=df.groupby('version').agg({'duration':['mean','std'],'ts':['min','max']}).reset_index()
s.columns=s.columns.map('_'.join)