I would like to concatenate two frames. Can do so as well.
However, while doing so date format is auto changing which is untended and needs to resolve. I've a column called EVENT_DATE in 'YYYY-MM-DD' format. But its being changed.
Here loading a sample TSV formatted data to data frame
>>>df1 = pd.read_csv('detail_trend_analysis_data.csv',delimiter='|', parse_dates=[0])
>>>df1.head()
EVENT_DATE EVENT_HOUR PRODUCT ... BONUS_VOLUME BONUS_COST RECORD_COUNT
0 2019-11-08 0 1 ... 0.0 220152.426342 287516
1 2019-11-08 0 1 ... 0.0 0.000000 3104
2 2019-11-08 0 1 ... 0.0 226544.777596 254965
3 2019-11-08 0 1 ... 0.0 0.000000 2449
4 2019-11-08 0 1 ... 0.0 0.000000 35085
[5 rows x 18 columns]
Doing Same thing
>>>df2 = pd.read_csv('detail_trend_analysis_data.csv',delimiter='|', parse_dates=[0])
Changing the date
>>>df2['EVENT_DATE']='2019-11-09'
>>>df2.head()
EVENT_DATE EVENT_HOUR PRODUCT ... BONUS_VOLUME BONUS_COST RECORD_COUNT
0 2019-11-09 0 1 ... 0.0 220152.426342 287516
1 2019-11-09 0 1 ... 0.0 0.000000 3104
2 2019-11-09 0 1 ... 0.0 226544.777596 254965
3 2019-11-09 0 1 ... 0.0 0.000000 2449
4 2019-11-09 0 1 ... 0.0 0.000000 35085
[5 rows x 18 columns]
Concatenating
>>>frames=[df1,df2]
>>>df=pd.concat(frames)
>>>df.head()
EVENT_DATE EVENT_HOUR ... BONUS_COST RECORD_COUNT
0 2019-11-08 00:00:00 0 ... 220152.426342 287516
1 2019-11-08 00:00:00 0 ... 0.000000 3104
2 2019-11-08 00:00:00 0 ... 226544.777596 254965
3 2019-11-08 00:00:00 0 ... 0.000000 2449
4 2019-11-08 00:00:00 0 ... 0.000000 35085
[5 rows x 18 columns]
But at the end time changes to 'YYY-MM-DD HH24:MI:SS' which I don't want.
How to resolve this?
what if you set df['EVENT_DATE] = df['EVENT_DATE'].dt.date on both dataframes?
Related
Let's say I have a dataframe like this:
>> Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44 1 96 1 40 1 88 0 81
1 2017-05-01 State NY 0 42 0 55 1 92 1 82 0 38
2 2017-06-01 State NY 1 11 0 7 1 35 0 70 1 61
3 2017-07-01 State NY 1 12 1 80 1 83 1 47 1 44
4 2017-08-01 State NY 1 63 1 48 0 61 0 5 0 20
5 2017-09-01 State NY 1 56 1 92 0 55 0 45 1 17
I'd like to replace all the values of columns with _rank as NaN if it's corresponding flag is zero.To get something like this:
>> Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44.0 1 96.0 1 40.0 1 88.0 0 NaN
1 2017-05-01 State NY 0 NaN 0 NaN 1 92.0 1 82.0 0 NaN
2 2017-06-01 State NY 1 11.0 0 NaN 1 35.0 0 NaN 1 61.0
3 2017-07-01 State NY 1 12.0 1 80.0 1 83.0 1 47.0 1 44.0
4 2017-08-01 State NY 1 63.0 1 48.0 0 NaN 0 NaN 0 NaN
5 2017-09-01 State NY 1 56.0 1 92.0 0 NaN 0 NaN 1 17.0
Which is fairly simple. This is my approach for the same:
for k in variables:
dt[k+'_rank'] = np.where(dt[k+'_flag']==0,np.nan,dt[k+'_rank'])
Although this works fine for a smaller dataset, it takes a significant amount of time for processing a dataframe with very high number of columns and entries. So is there a optimized way of achieving the same without iteration?
P.S. There are other payloads apart from _rank and _flag in the data.
Thanks in advance
Use .str.endswith to filter the columns that ends with _flag, then use rstrip to strip the flag label and add rank label to get the corresponding column names with rank label, then use np.where to fill the NaN values in the columns containing _rank depending upon the condition when the corresponding values in flag columns is 0:
flags = df.columns[df.columns.str.endswith('_flag')]
ranks = flags.str.rstrip('flag') + 'rank'
df[ranks] = np.where(df[flags].eq(0), np.nan, df[ranks])
OR, it is also possible to use DataFrame.mask:
df[ranks] = df[ranks].mask(df[flags].eq(0).to_numpy())
Result:
# print(df)
Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44.0 1 96.0 1 40.0 1 88.0 0 NaN
1 2017-05-01 State NY 0 NaN 0 NaN 1 92.0 1 82.0 0 NaN
2 2017-06-01 State NY 1 11.0 0 NaN 1 35.0 0 NaN 1 61.0
3 2017-07-01 State NY 1 12.0 1 80.0 1 83.0 1 47.0 1 44.0
4 2017-08-01 State NY 1 63.0 1 48.0 0 NaN 0 NaN 0 NaN
5 2017-09-01 State NY 1 56.0 1 92.0 0 NaN 0 NaN 1 17.0
I have a dataset as below.
building_id meter meter_reading primary_use square_feet air_temperature dew_temperature sea_level_pressure wind_direction wind_speed hour day weekend month
0 0 0 NaN 0 7432 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
1 1 0 NaN 0 2720 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
2 2 0 NaN 0 5376 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
3 3 0 NaN 0 23685 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
4 4 0 NaN 0 116607 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
You can see that the values under meter_reading are Nan and i like to fill that up with that column mean grouped by "primary_use" and "square_feet" column. Which api I could use to achieve this. I am currently using scikit learn's imputer.
Thanks and your help is highly appreciated.
If you use pandas data frame, it already brings everything you need.
Note that priary_use is a categorical feature while square_feet is continuous. So first you would like to split square_feet into categories, so you can calculate the mean meter_reading per group.
I have a dataframe like this
import pandas as pd
raw_data = {'ID':['101','101','101','101','101','102','102','103'],
'Week':['W01','W02','W03','W07','W08','W01','W02','W01'],
'Orders':[15,15,10,15,15,5,10,10]}
df2 = pd.DataFrame(raw_data, columns = ['ID','Week','Orders'])
i wanted row by row percentages within groups.
How can i achieve like this
Using pct_change
df2.groupby('ID').Orders.pct_change()).add(1).fillna(0)
I find it wired in my pandas version pct_change can not do with groupby object , so that we need to do with
df2['New']=sum(l,[])
df2.New=(df2.New+1).fillna(0)
df2
Out[606]:
ID Week Orders New
0 101 W01 15 0.000000
1 101 W02 15 1.000000
2 101 W03 10 0.666667
3 101 W07 15 1.500000
4 101 W08 15 1.000000
5 102 W01 5 0.000000
6 102 W02 10 2.000000
7 103 W01 10 0.000000
Carry out a window operation shifting the value by 1 position
df2['prev']=df2.groupby(by='ID').Orders.shift(1).fillna(0)
Calculate % change individually using apply()
df2['pct'] = df2.apply(lambda x : ((x['Orders'] - x['prev']) / x['prev']) if x['prev'] != 0 else 0,axis=1)
I am not sure if there is any default pd.pct_change() within a window.
ID Week Orders prev pct
0 101 W01 15 0.0 0.000000
1 101 W02 15 15.0 0.000000
2 101 W03 10 15.0 -0.333333
3 101 W07 15 10.0 0.500000
4 101 W08 15 15.0 0.000000
5 102 W01 5 0.0 0.000000
6 102 W02 10 5.0 1.000000
7 103 W01 10 0.0 0.000000
I have a dataframe,df as below
Index DateTimestamp a b c
0 2017-08-03 00:00:00 ta bc tt
1 2017-08-03 00:00:00 re
3 2017-08-03 00:00:00 cv ma
4 2017-08-04 00:00:00
5 2017-09-04 00:00:00 cv
: : : : :
: : : : :
I want to group by 1day the count of values in each column by not considering the empty values in each column. So the output will be
Index a b c
2017-08-03 00:00:00 2 2 2
2017-08-04 00:00:00 0 1 0
I have tried this but not want i want:
df2=df.groupby([pd.Grouper(key='DeviceDateTimeStamp', freq='1D')]) ['a','b','c'].apply(pd.Series.count)
Use dt.floor or date for remove times with GroupBy.count for exclude count missing values:
print (df)
Index DateTimestamp a b c
0 0 2017-08-03 00:00:00 ta bc tt
1 1 2017-08-03 00:00:00 re NaN NaN
2 3 2017-08-03 00:00:00 NaN cv ma
3 4 2017-08-04 00:00:00 NaN NaN NaN
4 5 2017-09-04 00:00:00 NaN cv NaN
df2=df.groupby(df['DateTimestamp'].dt.floor('d'))['a','b','c'].count()
#another solution
#df2=df.groupby(df['DateTimestamp'].dt.date)['a','b','c'].count()
print (df2)
a b c
DateTimestamp
2017-08-03 2 2 2
2017-08-04 0 0 0
2017-09-04 0 1 0
EDIT:
print (df)
Index DateTimestamp a b c
0 0 2017-08-03 00:00:00 ta bc tt
1 1 2017-08-03 00:00:00 re
2 3 2017-08-03 00:00:00 cv ma
3 4 2017-08-04 00:00:00
4 5 2017-09-04 00:00:00 cv
Or if possible numeric values in a,b,c columns:
c = ['a','b','c']
df2=df[c].astype(str).ne('').groupby(df['DateTimestamp'].dt.floor('d')).sum().astype(int)
print (df2)
a b c
DateTimestamp
2017-08-03 2 2 2
2017-08-04 0 0 0
2017-09-04 0 1 0
I have a dataframe as below
0 1 2 3 4 5
0 0.428519 0.000000 0.0 0.541096 0.250099 0.345604
1 0.056650 0.000000 0.0 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.0 0.000000 0.000000 0.000000
3 0.849066 0.559117 0.0 0.374447 0.424247 0.586254
4 0.317644 0.000000 0.0 0.271171 0.586686 0.424560
I would like to modify it as below
0 0 0.428519
0 1 0.000000
0 2 0.0
0 3 0.541096
0 4 0.250099
0 5 0.345604
1 0 0.056650
1 1 0.000000
........
Use stack with reset_index:
df1 = df.stack().reset_index()
df1.columns = ['col1','col2','col3']
print (df1)
col1 col2 col3
0 0 0 0.428519
1 0 1 0.000000
2 0 2 0.000000
3 0 3 0.541096
4 0 4 0.250099
5 0 5 0.345604
6 1 0 0.056650
7 1 1 0.000000
8 1 2 0.000000
9 1 3 0.000000
10 1 4 0.000000
11 1 5 0.000000
12 2 0 0.000000
13 2 1 0.000000
14 2 2 0.000000
15 2 3 0.000000
16 2 4 0.000000
17 2 5 0.000000
18 3 0 0.849066
19 3 1 0.559117
20 3 2 0.000000
21 3 3 0.374447
22 3 4 0.424247
23 3 5 0.586254
24 4 0 0.317644
25 4 1 0.000000
26 4 2 0.000000
27 4 3 0.271171
28 4 4 0.586686
29 4 5 0.424560
Numpy solution with numpy.tile and numpy.repeat, flattening is by numpy.ravel:
df2 = pd.DataFrame({
"col1": np.repeat(df.index, len(df.columns)),
"col2": np.tile(df.columns, len(df.index)),
"col3": df.values.ravel()})
print (df2)
col1 col2 col3
0 0 0 0.428519
1 0 1 0.000000
2 0 2 0.000000
3 0 3 0.541096
4 0 4 0.250099
5 0 5 0.345604
6 1 0 0.056650
7 1 1 0.000000
8 1 2 0.000000
9 1 3 0.000000
10 1 4 0.000000
11 1 5 0.000000
12 2 0 0.000000
13 2 1 0.000000
14 2 2 0.000000
15 2 3 0.000000
16 2 4 0.000000
17 2 5 0.000000
18 3 0 0.849066
19 3 1 0.559117
20 3 2 0.000000
21 3 3 0.374447
22 3 4 0.424247
23 3 5 0.586254
24 4 0 0.317644
25 4 1 0.000000
26 4 2 0.000000
27 4 3 0.271171
28 4 4 0.586686
29 4 5 0.424560