Pandas: Remove aggfun np.sum labels and value labels from pivot table - excel

I have created a pandas pivot table, and I am trying to remove the 'sum' and 'LENGTH' rows from the output xlsx.
So far I have tried to remove the two rows upon exporting the pivot table to xlsx.
I have tried to read in the exported pivot table and DataFrame.drop the two rows and re-export.
I am not having much luck. Thanks all in advance!
Link to pic:
http://i.stack.imgur.com/AmjFy.png

You can use droplevel:
df.columns = df.columns.droplevel([0,1])
print df
STATUS X Y Z
CODE
A 13.0 6 20
B NaN 472 472
C NaN 105 105
D 13.0 584 598
And then maybe reset_index with rename_axis (new in pandas 0.18.0):
df = df.reset_index().rename_axis(None, axis=1)
print df
CODE X Y Z
0 A 13.0 6 20
1 B NaN 472 472
2 C NaN 105 105
3 D 13.0 584 598

Related

substract two ECDF time series

Hi I have a ECDF plot by seaborn which is the following.
I can obtain this by doing sns.ecdfplot(data=df2, x='time', hue='seg_oper', stat='count').
My dataframe is very simple:
In [174]: df2
Out[174]:
time seg_oper
265 18475 1->0:ADD['TX']
2342 78007 0->1:ADD['RX']
2399 78613 1->0:DELETE['TX']
2961 87097 0->1:ADD['RX']
2994 87210 0->1:ADD['RX']
... ... ...
330823 1002281 1->0:DELETE['TX']
331256 1003545 1->0:DELETE['TX']
331629 1004961 1->0:DELETE['TX']
332375 1006663 1->0:DELETE['TX']
333083 1008644 1->0:DELETE['TX']
[834 rows x 2 columns]
How can I substract series 0->1:ADD['RX'] from 1->0:DELETE['TX']?
I like seaborn because most of this data mangling is done inside the library, but in this case I need to substract these two series ...
Thanks.
So the first thing is to obtain what seaborn does, but manually. After that (because I need to) I can subtract one series from the other.
Cumulative Count
First we need to obtain a cumulative count per each series.
In [304]: df2['cum'] = df2.groupby(['seg_oper']).cumcount()
In [305]: df2
Out[305]:
time seg_oper cum
265 18475 1->0:ADD['TX'] 0
2961 87097 0->1:ADD['RX'] 1
2994 87210 0->1:ADD['RX'] 2
... ... ... ...
332375 1006663 1->0:DELETE['TX'] 413
333083 1008644 1->0:DELETE['TX'] 414
Pivot the data
Rearrange the DF.
In [307]: df3 = df2.pivot(index='time', columns='seg_oper',values='cum').reset_index()
In [308]: df3
Out[308]:
seg_oper time 0->1:ADD['RX'] 1->0:ADD['TX'] 1->0:DELETE['TX']
0 18475 NaN 0.0 NaN
1 78007 0.0 NaN NaN
2 78613 NaN NaN 0.0
3 87097 1.0 NaN NaN
4 87210 2.0 NaN NaN
.. ... ... ... ...
828 1002281 NaN NaN 410.0
829 1003545 NaN NaN 411.0
830 1004961 NaN NaN 412.0
831 1006663 NaN NaN 413.0
832 1008644 NaN NaN 414.0
[833 rows x 4 columns]
Fill the gaps
I'm assuming that the NaN values can be filled with the previous value of the row until the next one.
df3=df3.fillna(method='ffill')
At this point, if you plot df3 you'll obtain the same as doing sns.ecdfplot(df2) with seaborn.
I still want to substract one series from the other.
df3['diff'] = df3["0->1:ADD['RX']"] - df3["1->0:DELETE['TX']"]
df3.plot(x='time')
The following plot, is the result.
pd: I don't understand the negative vote on the question. If someone can explain, I'll appreciate it.

TypeError: '(slice(None, 59, None), slice(None, None, None))' is an invalid key

I am having the below table where I want to remove these rows with NaN values.
date Open ... Real Lower Band Real Upper Band
0 2020-07-08 08:05:00 2.1200 ... NaN NaN
1 2020-07-08 09:00:00 2.1400 ... NaN NaN
2 2020-07-08 09:30:00 2.1800 ... NaN NaN
3 2020-07-08 09:35:00 2.2000 ... NaN NaN
4 2020-07-08 09:40:00 2.1710 ... NaN NaN
5 2020-07-08 09:45:00 2.1550 ... NaN NaN
These NaN values are til row no. 58
For this, I wrote the following code. But the above error occurred.
data.drop(data[:59,:],inplace= True)
print(data)
Please help me!
There are many options to choose from:
Drop rows by index label.
df.drop(list(range(59)), axis=0, inplace=True)
Drop if nans in selected columns.
df.dropna(axis=0, subset=['Real Upper Band'], inplace=True)
Select rows to keep by index label slice
df = df.loc[59:, :] # 59 is the label in index, if index was date then replace 59 with corresponding datetime
Select rows to keep by integer index slice (similar to slicing a list)
df = df.iloc[59:, :] # 59 is the 0-index row number, regardless of what index is set on df
Filter with .loc and boolean array returned by .isna()
df = df.loc[~df['Real Upper Band'].isna(), :]
Remember that loc and iloc work with two dimensions when applied to dataframes, it is recomended to use full slice : to avoid ambiguity and improve performance according to the docs https://pandas.pydata.org/docs/user_guide/indexing.html
You want to keep rows from 59-th on, so the shortest code you can run is:
data = data[59:]

Iterate over rows in a data frame create a new column then adding more columns based on the new column

I have a data frame as below:
Date Quantity
2019-04-25 100
2019-04-26 148
2019-04-27 124
The output that I need is to take the quantity difference between two next dates and average over 24 hours and create 23 columns with hourly quantity difference added to the column before such as below:
Date Quantity Hour-1 Hour-2 ....Hour-23
2019-04-25 100 102 104 .... 146
2019-04-26 148 147 146 .... 123
2019-04-27 124
I'm trying to iterate over a loop but it's not working ,my code is as below:
for i in df.index:
diff=(df.get_value(i+1,'Quantity')-df.get_value(i,'Quantity'))/24
for j in range(24):
df[i,[1+j]]=df.[i,[j]]*(1+diff)
I did some research but I have not found how to create columns like above iteratively. I hope you could help me. Thank you in advance.
IIUC using resample and interpolate, then we pivot the output
s=df.set_index('Date').resample('1 H').interpolate()
s=pd.pivot_table(s,index=s.index.date,columns=s.groupby(s.index.date).cumcount(),values=s,aggfunc='mean')
s.columns=s.columns.droplevel(0)
s
Out[93]:
0 1 2 3 ... 20 21 22 23
2019-04-25 100.0 102.0 104.0 106.0 ... 140.0 142.0 144.0 146.0
2019-04-26 148.0 147.0 146.0 145.0 ... 128.0 127.0 126.0 125.0
2019-04-27 124.0 NaN NaN NaN ... NaN NaN NaN NaN
[3 rows x 24 columns]
If I have understood the question correctly.
for loop approach:
list_of_values = []
for i,row in df.iterrows():
if i < len(df) - 2:
qty = row['Quantity']
qty_2 = df.at[i+1,'Quantity']
diff = (qty_2 - qty)/24
list_of_values.append(diff)
else:
list_of_values.append(0)
df['diff'] = list_of_values
Output:
Date Quantity diff
2019-04-25 100 2
2019-04-26 148 -1
2019-04-27 124 0
Now create the columns required.
i.e.
df['Hour-1'] = df['Quantity'] + df['diff']
df['Hour-2'] = df['Quantity'] + 2*df['diff']
.
.
.
.
There are other approaches which will work way better.

Reshape Pandas dataframe based on values in two columns

In Python, I would like to search through all rows in the dataframe with two possible paths (dataframe is populated from csv files). If the 'Group' column for a given row is zero, move that row's data to the next row of a new dataframe using the 'Channel_1' and 'Data_1' columns. If the 'Group' column for a given row is non-zero, then get all three rows with the same 'Group' column value (also identified by 'sub-group' column as 1, 2 or 3) and add to the next row of the new dataframe.
Code to generate dataframe from csv file:
for name in glob.glob(search_string):
r_file = pd.read_csv(name)
Current Data Format:
Channel_Num Group Sub_Group Data
1000 1 1 100
1001 1 2 105
1002 1 3 110
1003 0 0 200
1004 2 1 400
1005 2 2 405
1006 2 3 410
1007 0 0 500
Desired Data Format:
Group Channel_1 Data_1 Channel_2 Data_2 Channel_3 Data_3
1 1000 100 1001 105 1002 110
0 1003 200 NaN NaN NaN NaN
2 1004 400 1005 405 1006 410
0 1007 500 NaN NaN NaN NaN
I've tried the GroupBy and pivot_table methods but without success. After the data is in the desired format, there are other calculations that need run against the newly organized data but getting it in the desired format is the key.
This is more like a pivot problem after create the additional key by using diff and cumsum , so I am using pivot_table and multiple columns flatten
df.loc[df.Sub_Group==0,'Sub_Group']=1
df['newkey']=df.Group.diff().ne(0).cumsum()
s=df.pivot_table(index=['Group','newkey'],columns=['Sub_Group'],values=['Channel_Num','Data'],aggfunc='first').sort_index(level=1,axis=1)
s.columns=s.columns.map('{0[0]}_{0[1]}'.format)
s.reset_index(level=0).sort_index()
Out[25]:
Group Channel_Num_1 Data_1 ... Data_2 Channel_Num_3 Data_3
newkey ...
1 1 1000.0 100.0 ... 105.0 1002.0 110.0
2 0 1003.0 200.0 ... NaN NaN NaN
3 2 1004.0 400.0 ... 405.0 1006.0 410.0
4 0 1007.0 500.0 ... NaN NaN NaN
[4 rows x 7 columns]

Creating sqlite table from csv files with different column names

I have a large amount .csv files that I would like to put in a sqlite database. Most of the files contain the same column names, but there are some files that have extra columns.
The code that I've tried is (altered to be generic):
import os
import pandas as pd
import sqlite3
conn = sqlite3.connect('test.db')
cur = conn.cursor()
os.chdir(dir)
for file in os.listdir(dir):
df = pd.read_csv(file)
df.to_sql('X', conn, if_exists = 'append')
When it encounters a file with column that is not in table X I get the error:
OperationalError: table X has no column named ColumnZ
How can I alter my code to append the table with the new column and fill previous rows with NaN?
If all DataFrames can fit into RAM, you can do this:
import glob
files = glob.glob(r'/path/to/csv_files/*.csv')
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
df.to_sql('X', conn, if_exists = 'replace')
Demo:
In [22]: d1
Out[22]:
a b
0 0 1
1 2 3
In [23]: d2
Out[23]:
a b c
0 1 2 3
1 4 5 6
In [24]: d3
Out[24]:
x b
0 11 12
1 13 14
In [25]: pd.concat([d1,d2,d3], ignore_index=True)
Out[25]:
a b c x
0 0.0 1 NaN NaN
1 2.0 3 NaN NaN
2 1.0 2 3.0 NaN
3 4.0 5 6.0 NaN
4 NaN 12 NaN 11.0
5 NaN 14 NaN 13.0
Alternatively you can store all columns as a list and check in a loop whether a new DF has additional columns and add those columns to the SQLite DB, using SQLite ALTER TABLE statement:
ALTER TABLE tab_name ADD COLUMN ...

Resources