I want to separate a dataset in the following fashion:
import pandas as pd
import numpy as np
df = pd.read_csv("https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv")
sepal_length = df["sepal_length"]
sepal_length
0 5.1
1 4.9
2 4.7
3 4.6
4 5.0
...
145 6.7
146 6.3
147 6.5
148 6.2
149 5.9
Name: sepal_length, Length: 150, dtype: float64
I would like to create another dataset, trying to predict those values, based in 10 previous observations for instance (Suppose that this dataset is ordered and date dependant).
So for my predictors, I would like to have another dataset having the 10 previous values for each index. this is:
10 x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
11 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
...
where $ x_i $ is the sepal length at the i-th index.
This does what you want:
for i in range(1,11):
df[f'feature_{i}']=df['sepal_length'].shift(i)
Related
Assuming we have dataset df (which can be downloaded from this link), I want to create some features based on the average growth rate of y for the month of the past several years, for example: y_agr_last2, y_agr_last3, y_agr_last4, etc.
The formula is:
For example, for September 2022, y_agr_last2 = ((1 + 3.85/100)*(1 + 1.81/100))^(1/2) -1, y_agr_last3 = ((1 + 3.85/100)*(1 + 1.81/100)*(1 + 1.6/100))^(1/3) -1.
The code I use is as follows, which is relatively repetitive and trivial:
import math
df['y_shift12'] = df['y'].shift(12)
df['y_shift24'] = df['y'].shift(24)
df['y_shift36'] = df['y'].shift(36)
df['y_agr_last2'] = pow(((1+df['y_shift12']/100) * (1+df['y_shift24']/100)), 1/2) -1
df['y_agr_last3'] = pow(((1+df['y_shift12']/100) * (1+df['y_shift24']/100) * (1+df['y_shift36']/100)), 1/3) -1
df.drop(['y_shift12', 'y_shift24', 'y_shift36'], axis=1, inplace=True)
df
How can the desired result be achieved more concisely?
References:
Create some features based on the mean of y for the month over the past few years
Following is one way to generalise it:
import functools
import operator
num_yrs = 3
for n in range(1, num_yrs+1):
df[f"y_shift{n*12}"] = df["y"].shift(n*12)
df[f"y_agr_last{n}"] = pow(functools.reduce(operator.mul, [1+df[f"y_shift{i*12}"]/100 for i in range(1, n+1)], 1), 1/n) - 1
df = df.drop(["y_agr_last1"] + [f"y_shift{n*12}" for n in range(1, num_yrs+1)], axis=1)
Output:
date y x1 x2 y_agr_last2 y_agr_last3
0 2018/1/31 -13.80 1.943216 3.135839 NaN NaN
1 2018/2/28 -14.50 0.732108 0.375121 NaN NaN
...
22 2019/11/30 4.00 -0.273262 -0.021146 NaN NaN
23 2019/12/31 7.60 1.538851 1.903968 NaN NaN
24 2020/1/31 -11.34 2.858537 3.268478 -0.077615 NaN
25 2020/2/29 -34.20 -1.246915 -0.883807 -0.249940 NaN
26 2020/3/31 46.50 -4.213756 -4.670146 0.221816 NaN
...
33 2020/10/31 -1.00 1.967062 1.860070 -0.035569 NaN
34 2020/11/30 12.99 2.302166 2.092842 0.041998 NaN
35 2020/12/31 5.54 3.814303 5.611199 0.030017 NaN
36 2021/1/31 -6.41 4.205601 4.948924 -0.064546 -0.089701
37 2021/2/28 -22.38 4.185913 3.569100 -0.342000 -0.281975
38 2021/3/31 17.64 5.370519 3.130884 0.465000 0.298025
...
54 2022/7/31 0.80 -6.259455 -6.716896 0.057217 0.052793
55 2022/8/31 -5.30 1.302754 1.412277 0.015121 -0.000492
56 2022/9/30 NaN -2.876968 -3.785964 0.028249 0.024150
I have the following dataframe:
Out[117]: mydata
author email ri oi
0 X1 NaN NaN 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab2#ma.com NaN 0000-0001-8437-498X
4 X5 ab#ma.com NaN 0000-0001-8437-498X
where column ri represents an author's ResearcherID, and oi the ORCID. One author may has more than one email address, so column email has duplicates.
Firstly, I'm trying to fill na in ri if the corresponding rows in oi share the same value, using a non-NaN value in ri. The result I want is:
author email ri oi
0 X1 NaN K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab2#ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab#ma.com K-5448-2012 0000-0001-8437-498X
Secondly, merging emails and using the merged value to fill na in column email, if the values in ri (or oi) are identical. I want to get a dataframe like the following one:
author email ri oi
0 X1 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
I've tried the following code:
final_df = pd.DataFrame()
na_df = mydata[mydata.oi.isna()]
for i in set(mydata.oi.dropna()):
fill_df = mydata[mydata.oi == i]
fill_df.ri = fill_df.ri.fillna(method='ffill')
fill_df.ri = fill_df.ri.fillna(method='bfill')
null_df = pd.concat([null_df, fill_df])
final_df = pd.concat([final_df, na_df])
This code returned the one I want in the the frist step, but is there an elegent way to approach this? Furthermore, how to get the merged value in email and then use the merged value as an input in the process of filling na?
Try 2 transform. One for each column. On ri, use first. On email, use combination of dropna, unique, and join
g = df.dropna(subset=['oi']).groupby('oi')
df['ri'] = g.ri.transform('first')
df['email'] = g.email.transform(lambda x: ';'.join(x.dropna().unique()))
Out[79]:
author email ri oi
0 X1 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
I have a dataframe as below:
import pandas as pd
import dask.dataframe as dd
a = {'b':['category','categorical','cater pillar','coming and going','bat','No Data','calling','cal'],
'c':['strd1','strd2','strd3', 'strd4','strd5','strd6','strd7', 'strd8']
}
df11 = pd.DataFrame(a,index=['x1','x2','x3','x4','x5','x6','x7','x8'])
I wanted to remove words whose length of each value is three.
I expect results to be like:
b c
category strd1
categorical strd2
cater pillar strd3
coming and going strd4
NaN strd5
No Data strd6
calling strd7
NaN strd8
Use series.str.len() to identify the length of the string in a series and then compare with series.eq(), then using df.loc[] you can assign the values of b as np.nan where the condition matches:
df11.loc[df11.b.str.len().eq(3),'b']=np.nan
b c
x1 category strd1
x2 categorical strd2
x3 cater pillar strd3
x4 coming and going strd4
x5 NaN strd5
x6 No Data strd6
x7 calling strd7
x8 NaN strd8
Use str.len to get the length of each string and then conditionally replace them toNaN with np.where if the length is equal to 3:
df11['b'] = np.where(df11['b'].str.len().eq(3), np.NaN, df11['b'])
b c
0 category strd1
1 categorical strd2
2 cater pillar strd3
3 coming and going strd4
4 NaN strd5
5 No Data strd6
6 calling strd7
7 NaN strd8
Maybe check mask
df11.b.mask(df11.b.str.len()<=3,inplace=True)
df11
Out[16]:
b c
x1 category strd1
x2 categorical strd2
x3 cater pillar strd3
x4 coming and going strd4
x5 NaN strd5
x6 No Data strd6
x7 calling strd7
x8 NaN strd8
You could use a where conditional:
df11['b'] = df11['b'].where(df11.b.map(len) != 3, np.nan)
Something like:
for i, ele in enumerate(df11['b']):
if len(ele) == 3:
df11['b'][i] = np.nan
I am concatenating two dataframes along axis = 1 (columns) and try to use "keys" to later be able to distinguish between the columns of the two dataframes that have the same name.
df1 = pd.DataFrame({'tl': ['x1', 'x2', 'x3', 'x4'],
'ff': ['y1', 'y2', 'y3', 'y4'],
'dd': ['z1', 'z2', 'z3', 'z4']},
index=[2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04])
df2 = pd.DataFrame({'tl': ['x1', 'x2', 'x3', 'x4'],
'ff': ['y1', 'y2', 'y3', 'y4'],
'rf': ['z1', 'z2', 'z3', 'z4']},
index=[2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04])
data = pd.concat([df1, df2],keys=['snow','wind'], axis=1, ignore_index=True)
However, when trying to print all the columns belonging to one of the keys as suggested by #YashTD in Pandas add keys while concatenating dataframes at column level
print(comb_data.snow.tl)
I get the following error message:
AttributeError: 'DataFrame' object has no attribute 'snow'
I think, the keys are just not being added to the dataframe, but I don't know why. They also don't show up wenn printing the dataframe head() at they should be suggested by
Pandas add keys while concatenating dataframes at column level
Do you know how to add the keys to the dataframe?
First remove parameter ignore_index=True for MultiIndex in columns and then select by tuple:
data = pd.concat([df1, df2],keys=['snow','wind'], axis=1)
print (data)
snow wind
tl ff dd tl ff rf
2016-01-01 x1 y1 z1 x1 y1 z1
2016-01-02 x2 y2 z2 x2 y2 z2
2016-01-03 x3 y3 z3 x3 y3 z3
2016-01-04 x4 y4 z4 x4 y4 z4
print (data[('snow','tl')])
2016-01-01 x1
2016-01-02 x2
2016-01-03 x3
2016-01-04 x4
Name: (snow, tl), dtype: object
I have a dataset which look like this
time channel min sd mag. frequency
12:00 X 12.0 2.3 x11 fx11
12:00 X 12.0 2.3 x12 fx12
12:00 X 12.0 2.3 x13 fx13
12:00 X 12.0 2.3 x14 fx14
12:00 X 12.0 2.3 x15 fx15
12:00 Y 17.0 2.7 y11 fy11
12:00 Y 17.0 2.7 y12 fy12
12:00 Y 17.0 2.7 y13 fy13
12:00 Y 17.0 2.7 y14 fy14
12:00 Y 17.0 2.7 y15 fy15
12:00 Z 15.0 4.3 z11 fz11
12:00 Z 15.0 4.3 z12 fz12
12:00 Z 15.0 4.3 z13 fz13
12:00 Z 15.0 4.3 z14 fz14
12:00 Z 15.0 4.3 z15 fz15
12:01 X 13.0 4.9 x21 fx21
.... ... ... ... ... .....
.... ..... .... ... .... ..... ....
As you could see that for channel X, Y, Z there are entries like 'time', 'min' and 'sd' repeating 5 times, however 'mag.' and 'frequency' are changing each time. The shape of this dataset is (740231, 6), where this 15 rows for channel X,Y,Z keep repeating as I described above.
I would like to get rid of this repetition and would like to transform this dataset like this:
time channel min sd m1 f1 m2 f2 m3 f3 m4 f4 m5 f5
12:00 X 12.0 2.3 x11 fx11 x12 fx12 x13 fx13 x14 fx14 x15 fx15
12:00 Y 17.0 2.7 y11 fy11 y12 fy12 y13 fy13 y14 fy14 y15 fy15
12:00 Y 15.0 4.3 z11 fz11 z12 fz12 z13 fz13 z14 fz14 z15 fz15
12:01 X 13.0 4.9 x21 fx21 x22 fx22 x23 fx23 x24 fx24 x25 fx25
.... ... ..... ... .... ..... .... ..... .... .... ....
.... ..... .... .... .... ... .... ..... .... .... ... ... ...
which means that 15 rows x 6 columns values are now transformed in 3 rows x 14 columns.
Any suggestions is appreciated. Many thanks for your time.
Best Regards,
pooja
If ordering of output column should be swapped - first f and then m columns:
cols = ['time','channel','min', 'sd']
d = {'frequency':'f','mag.':'m'}
g = df.groupby(cols).cumcount().add(1).astype(str)
df = df.rename(columns=d).set_index(cols + [g]).unstack().sort_index(axis=1, level=1)
df.columns = df.columns.map(''.join)
df = df.reset_index()
print (df)
time channel min sd f1 m1 f2 m2 f3 m3 f4 m4 f5 \
0 12:00 X 12.0 2.3 fx11 x11 fx12 x12 fx13 x13 fx14 x14 fx15
1 12:00 Y 17.0 2.7 fy11 y11 fy12 y12 fy13 y13 fy14 y14 fy15
2 12:00 Z 15.0 4.3 fz11 z11 fz12 z12 fz13 z13 fz14 z14 fz15
3 12:01 X 13.0 4.9 fx21 x21 NaN NaN NaN NaN NaN NaN NaN
m5
0 x15
1 y15
2 z15
3 NaN
Explanation:
First rename columns by dictionary
Then set_index by counter Series created by cumcount with added 1 and converted to strings
Reshape by unstack
Soer second level of MultiIndex by sort_index
Flatten MultiIndex columns by map and join
Last reset_index for column from index
If ordering of output columns is important is possible use double rename of columns:
cols = ['time','channel','min', 'sd']
d = {'frequency':2,'mag.':1}
g = df.groupby(cols).cumcount().add(1).astype(str)
df = (df.rename(columns=d)
.set_index(cols + [g])
.unstack()
.sort_index(axis=1, level=1)
.rename(columns={2:'f', 1:'m'}))
df.columns = df.columns.map(''.join)
df = df.reset_index()
print (df)
time channel min sd m1 f1 m2 f2 m3 f3 m4 f4 m5 \
0 12:00 X 12.0 2.3 x11 fx11 x12 fx12 x13 fx13 x14 fx14 x15
1 12:00 Y 17.0 2.7 y11 fy11 y12 fy12 y13 fy13 y14 fy14 y15
2 12:00 Z 15.0 4.3 z11 fz11 z12 fz12 z13 fz13 z14 fz14 z15
3 12:01 X 13.0 4.9 x21 fx21 NaN NaN NaN NaN NaN NaN NaN
f5
0 fx15
1 fy15
2 fz15
3 NaN