Working with result of pandas.pivot_table - python-3.x

I am having trouble using reshaped data with pandas. Imagine I have a dataframe in long format like:
town year type var1 var2
a 2010 a 100 200
b 2010 a 100 200
c 2010 a 100 200
a 2011 a 100 200
b 2011 a 100 200
c 2011 a 100 200
a 2010 b 100 200
b 2010 b 100 200
c 2010 b 100 200
a 2011 b 100 200
b 2011 b 100 200
c 2011 b 100 200
I then reshape it into wide format like so:
df = pd.pivot_table(df, index="town", columns=["year", "type"], values=["var1", "var2"]
var1 var2
year 2010 2011 2010 2011
type a b a b a b a b
town
a 100 200 100 200 100 200 100 200
b 100 200 100 200 100 200 100 200
c 100 200 100 200 100 200 100 200
How do I then access the resulting dataframe? For instance if I wanted to get data for all the towns, but only for the year 2010 and type b? I have tried using df.query but that results in a buffer type mismatch. I have tried using:
df[df["year"] == 2010]
But that results in a key error. Any help would be gratefully received. Thanks

Use slicers:
idx = pd.IndexSlice
df = df.loc[:, idx[:, 2010, 'b']]
print (df)
var1 var2
year 2010 2010
type b b
town
a 100 200
b 100 200
c 100 200
Or DataFrame.xs:
df = df.xs((2010, 'b'), axis=1, level=[1,2])
print (df)
var1 var2
town
a 100 200
b 100 200
c 100 200
Solution with filtering by Index.get_level_values and chained boolean mask by & for bitwise AND, but because filter columns need DataFrame.loc (first : means all rows):
m1 = df.columns.get_level_values('year') == 2010
m2 = df.columns.get_level_values('type') == 'b'
df = df.loc[:, m1 & m2]
print (df)
var1 var2
year 2010 2010
type b b
town
a 100 200
b 100 200
c 100 200

import pandas as pd
df = pd.read_csv('test.csv')
df1 = df.groupby(['year', 'type']).sum()
df1
df can get the table, then just use groupby,i think it's easier.
what i get is
var1 var2
year type
2010 a 300 600
b 300 600
2011 a 300 600
b 300 600

Related

Pandas: Add the value based on certain conditions

I'm new to Pandas. I have a data frame that looks something like this.
Name
Storage Location
Total Quantity
a
S1
100
a
S2
200
a
S3
300
a
S4
110
a
S5
200
b
S1
200
b
S2
300
b
S4
400
b
S5
150
c
S1
400
c
S5
500
I wanna sum the "Total Quantity" group by the Name and also specific storage location which are only "S1,S2,S3".
Name
Total Quantity
a
600
b
500
c
400
My desired output would be something like the above.
Kindly appreciate for you guys help. Thank you in advance!
You could use where to replace the unwanted Locations with NaN and use groupby + sum (since sum skips NaN by default):
out = df.where(df['Storage Location'].isin(['S1','S2','S3'])).groupby('Name', as_index=False)['Total Quantity'].sum()
Output:
Name Total Quantity
0 a 600.0
1 b 500.0
2 c 400.0
Use:
In [2378]: out = df[df['Storage Location'].isin(['S1', 'S2', 'S3'])].groupby('Name')['Total Quantity'].sum().reset_index()
In [2379]: out
Out[2379]:
Name Total Quantity
0 a 600
1 b 500
2 c 400

Mean imputation based on certain Conditions

I have the below dataframe,
Category Value
A 100
A -
B -
C 50
D 200
D 400
D -
As you can see, there are some values which have the hyphen symbol '-'. I want to replace those hyphons with the means of the corresponding category.
In the example, there are two entries for "A" - One row with value 100 and other with hyphen. So the mean would be 100 itself. For B, since there are no valid values, the mean would be the mean of the entire column which would be (100+50+200+400/4 = 187.5). For C, no changes and for D, the hyphen will be replaced by 300 (same logic as for "A").
Output:
Category Value
A 100
A 100
B 187.5
C 50
D 200
D 400
D 300
Try:
df = df.replace("-", np.nan)
df["Value"] = pd.to_numeric(df["Value"])
avg = df["Value"].mean()
df["Value"] = df["Value"].fillna(
df.groupby("Category")["Value"].transform(
lambda x: avg if x.isna().all() else x.mean()
)
)
print(df)
Prints:
Category Value
0 A 100.0
1 A 100.0
2 B 187.5
3 C 50.0
4 D 200.0
5 D 400.0
6 D 300.0

How to sum with Null Values in group by statement using agg function in python

I have a dataframe which looks like:
A B C
a 100 200
a NA 100
a 200 NA
a 100 100
b 200 200
b 100 200
b 200 100
b 200 100
I use the aggregate function on column B and column C as:
ag=data.groupby(['A']).agg({'B':'sum','C':'sum'}).reset_index()
Output:
A B C
a NULL NULL
b 700 600
Expected Output:
A B C
a 400 400
b 700 600
How can I modify my aggregate function so that NULL values are ignored?
Maybe you already though about this but is not possible in your problem, but you can replace the NA values by 0 in the dataframe before this operation. If you donĀ“t want to change the original dataframe you can transform it in a copy.
ag=data.replace(np.nan,0).groupby(['A']).agg({'B':'sum','C':'sum'}).reset_index()

pandas create a flag when merging two dataframes

I have two df - df_a and df_b,
# df_a
number cur
1000 USD
2000 USD
3000 USD
# df_b
number amount deletion
1000 0.0 L
1000 10.0 X
1000 10.0 X
2000 20.0 X
2000 20.0 X
3000 0.0 L
3000 0.0 L
I want to left merge df_a with df_b,
df_a = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df_a.fillna(value={'amount':0}, inplace=True)
but also create a flag called deleted in the result df_a, that has three possible values - full, partial and none;
full - if all rows associated with a particular number value, have deletion = L;
partial - if some rows associated with a particular number value, have deletion = L;
none - no rows associated with a particular number value, have deletion = L;
Also when doing the merge, rows from df_b with deletion = L should not be considered; so the result looks like,
number amount deletion deleted cur
1000 10.0 X partial USD
1000 10.0 X partial USD
2000 20.0 X none USD
2000 20.0 X none USD
3000 0.0 NaN full USD
I am wondering how to achieve that.
Idea is compare deletion column and aggregate all and
any, create helper dictionary and last map for new column:
g = df_b['deletion'].eq('L').groupby(df_b['number'])
m1 = g.any()
m2 = g.all()
d1 = dict.fromkeys(m1.index[m1 & ~m2], 'partial')
d2 = dict.fromkeys(m2.index[m2], 'full')
#join dictionries together
d = {**d1, **d2}
print (d)
{1000: 'partial', 3000: 'full'}
df = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df['deleted'] = df['number'].map(d).fillna('none')
print (df)
number cur amount deletion deleted
0 1000 USD 10.0 X partial
1 1000 USD 10.0 X partial
2 2000 USD 20.0 X none
3 2000 USD 20.0 X none
4 3000 USD NaN NaN full
For specify column none, if want create dictionary for it:
d1 = dict.fromkeys(m1.index[m1 & ~m2], 'partial')
d2 = dict.fromkeys(m2.index[m2], 'full')
d3 = dict.fromkeys(m2.index[~m1], 'none')
d = {**d1, **d2, **d3}
print (d)
{1000: 'partial', 3000: 'full', 2000: 'none'}
df = df_a.merge(df_b.loc[df_b.deletion != 'L'], how='left', on='number')
df['deleted'] = df['number'].map(d)
print (df)
number cur amount deletion deleted
0 1000 USD 10.0 X partial
1 1000 USD 10.0 X partial
2 2000 USD 20.0 X none
3 2000 USD 20.0 X none
4 3000 USD NaN NaN full

summing up certain rows in a panda dataframe

I have a pandas dataframe with 1000 rows and 10 columns. I am looking to aggregate rows 100-1000 and replace them with just one row where the indexvalue is '>100' and the column values are the sum of rows 100-1000 of each column. Any ideas on a simple way of doing this? Thanks in advance
Say I have the below
a b c
0 1 10 100
1 2 20 100
2 3 60 100
3 5 80 100
and I want it replaced with
a b c
0 1 10 100
1 2 20 100
>1 8 140 200
You could use ix or loc but it shows SettingWithCopyWarning:
ind = 1
mask = df.index > ind
df1 = df[~mask]
df1.ix['>1', :] = df[mask].sum()
In [69]: df1
Out[69]:
a b c
0 1 10 100
1 2 20 100
>1 8 140 200
To set it without warning you could do it with pd.concat. May be not elegant due to two transposing but worked:
ind = 1
mask = df.index > ind
df1 = pd.concat([df[~mask].T, df[mask].sum()], axis=1).T
df1.index = df1.index.tolist()[:-1] + ['>{}'.format(ind)]
In [36]: df1
Out[36]:
a b c
0 1 10 100
1 2 20 100
>1 8 140 200
Some demonstrations:
In [37]: df.index > ind
Out[37]: array([False, False, True, True], dtype=bool)
In [38]: df[mask].sum()
Out[38]:
a 8
b 140
c 200
dtype: int64
In [40]: pd.concat([df[~mask].T, df[mask].sum()], axis=1).T
Out[40]:
a b c
0 1 10 100
1 2 20 100
0 8 140 200

Resources