How to plot multiple charts using matplotlib from unstacked dataframe with Pandas - python-3.x

This is a sample of the dataset I have using the following piece of code
ComplaintCity = nyc_df.groupby(['City','Complaint Type']).size().sort_values().unstack()
top5CitiesByComplaints = ComplaintCity[top5Complaints].rename_axis(None, axis=1)
top5CitiesByComplaints
Blocked Driveway Illegal Parking Noise - Street/Sidewalk Noise - Commercial Derelict Vehicle
City
ARVERNE 35.0 58.0 29.0 2.0 27.0
ASTORIA 2734.0 1281.0 500.0 1554.0 363.0
BAYSIDE 377.0 514.0 15.0 40.0 198.0
BELLEROSE 95.0 106.0 13.0 37.0 89.0
BREEZY POINT 3.0 15.0 1.0 4.0 3.0
BRONX 12754.0 7859.0 8890.0 2433.0 1952.0
BROOKLYN 28147.0 27461.0 13354.0 11458.0 5179.0
CAMBRIA HEIGHTS 147.0 76.0 25.0 12.0 115.0
CENTRAL PARK NaN 2.0 95.0 NaN NaN
COLLEGE POINT 435.0 352.0 33.0 35.0 184.0
CORONA 2761.0 660.0 238.0 248.0
I want to be able to plot the same as a horizontal bar chart for each complaint. It should display the Cities with the highest count of complaints. Something similar to the image below. I am not sure how to go about it.

You can create a list of axis instances with subplots and plot the columns one-by-one:
fig, axes = plt.subplots(3,2,figsize=(10,6))
for c,ax in zip(df.columns, axes.ravel()):
df[c].sort_values().plot.barh(ax=ax)
fig.tight_layout()
Then you would get something like this:

Related

Stacked plot with spaced xticks

I have done an aggregation which resulted in the following dataframe
df2 = tweet.groupby(['Realdate', 'Type'])['Text'].count().unstack().fillna(0)
Type BLM Black
Realdate
2020-03-01 21.0 9.0
2020-03-02 20.0 13.0
2020-03-03 32.0 16.0
2020-03-04 3.0 9.0
2020-03-05 28.0 16.0
... ... ...
2020-07-10 4050.0 4474.0
2020-07-11 2815.0 3743.0
2020-07-12 3575.0 3863.0
2020-07-13 3435.0 4704.0
2020-07-14 3284.0 4352.0
I then created a stacked plot as follows:
df2[['BLM','Black']].plot(kind='bar', stacked=True, figsize=(20,10))
The output is:
I have too many days and i am struggling to space the xticks. Can someone help me please?
I was tempted to replace my xticks and generate new ones but i have been unsuccessful so far.
Thanks very much
After a lot of research, I found this answer
sum_df=tweet.groupby(['Realdate', 'Type'])['Text'].count().unstack().fillna(0)
sum_df=sum_df.reset_index()
print(sum_df)
fig, ax1 = plt.subplots(figsize=(15, 10))
ax1.set_xlabel('Dates')
ax1.set_ylabel('# of Tweets', color='b')
ax1.yaxis.tick_left()
sum_df[['Black', 'BLM']].plot( kind='bar', stacked=True, ax=ax1, figsize=(20,10))
ax1.legend(loc='upper left', fontsize=8)
ax1.set_xticklabels(sum_df.Realdate, rotation=90)
for label in ax1.xaxis.get_ticklabels()[::2]:
label.set_visible(False)
plt.legend(['#BlackLivesMatter', '#BLM'])
plt.title('# of Tweets per Day')
plt.show()

Removing outliers based on column variables or multi-index in a dataframe

This is another IQR outlier question. I have a dataframe that looks something like this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df
I would like to find and remove the outliers for each condition (i.e. Spring Placebo, Spring Drug, etc). Not the whole row, just the cell. And would like to do it for each of the 'red', 'yellow', 'green' columns.
Is there way to do this without breaking the dataframe into a whole bunch of sub dataframes with all of the conditions broken out separately? I'm not sure if this would be easier if 'Season' and 'Treatment' were handled as columns or indices. I'm fine with either way.
I've tried a few things with .iloc and .loc but I can't seem to make it work.
If need replace outliers by missing values use GroupBy.transform with DataFrame.quantile, then compare for lower and greater values by DataFrame.lt and DataFrame.gt, chain masks by | for bitwise OR and set missing values in DataFrame.mask, default replacement, so not specified:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)
c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)
print (df)
Season Treatment red yellow green
0 Spring Placebo NaN NaN 67.0
1 Spring Placebo 67.0 91.0 3.0
2 Spring Placebo 71.0 56.0 29.0
3 Spring Placebo 48.0 32.0 24.0
4 Spring Placebo 74.0 9.0 51.0
.. ... ... ... ... ...
95 Fall Drug 90.0 35.0 55.0
96 Fall Drug 40.0 55.0 90.0
97 Fall Drug NaN 54.0 NaN
98 Fall Drug 28.0 50.0 74.0
99 Fall Drug NaN 73.0 11.0
[100 rows x 5 columns]

Pandas pivot with multiple items per column, how to avoid aggregating them?

Follow up to this question, in particular this comment.
Consider following dataframe:
df = pd.DataFrame({
'Person': ['Adam', 'Adam', 'Cesar', 'Diana', 'Diana', 'Diana', 'Erika', 'Erika'],
'Belonging': ['House', 'Car', 'Car', 'House', 'Car', 'Bike', 'House', 'Car'],
'Value': [300.0, 10.0, 12.0, 450.0, 15.0, 2.0, 600.0, 11.0],
})
Which looks like this:
Person Belonging Value
0 Adam House 300.0
1 Adam Car 10.0
2 Cesar Car 12.0
3 Diana House 450.0
4 Diana Car 15.0
5 Diana Bike 2.0
6 Erika House 600.0
7 Erika Car 11.0
Using a pivot_table() is a nice way to reshape this data that will allow querying it by Person and see all of their belongings in a single row, making it really easy to answer queries such as "How to find the Value of Persons Car, if they have a House valued more than 400.0?"
A pivot_table() can be easily built for this data set with:
df_pivot = df.pivot_table(
values='Value',
index='Person',
columns='Belonging',
)
Which will look like:
Belonging Bike Car House
Person
Adam NaN 10.0 300.0
Cesar NaN 12.0 NaN
Diana 2.0 15.0 450.0
Erika NaN 11.0 600.0
But this gets limited when a Person has more than one of the same type of Belonging, for example two Cars, two Houses or two Bikes.
Consider the updated data:
df = pd.DataFrame({
'Person': ['Adam', 'Adam', 'Cesar', 'Diana', 'Diana', 'Diana', 'Erika', 'Erika', 'Diana', 'Adam'],
'Belonging': ['House', 'Car', 'Car', 'House', 'Car', 'Bike', 'House', 'Car', 'Car', 'House'],
'Value': [300.0, 10.0, 12.0, 450.0, 15.0, 2.0, 600.0, 11.0, 21.0, 180.0],
})
Which looks like:
Person Belonging Value
0 Adam House 300.0
1 Adam Car 10.0
2 Cesar Car 12.0
3 Diana House 450.0
4 Diana Car 15.0
5 Diana Bike 2.0
6 Erika House 600.0
7 Erika Car 11.0
8 Diana Car 21.0
9 Adam House 180.0
Now that same pivot_table() will return the average of Diana's two cars, or Adam's two houses:
Belonging Bike Car House
Person
Adam NaN 10.0 240.0
Cesar NaN 12.0 NaN
Diana 2.0 18.0 450.0
Erika NaN 11.0 600.0
So we can pass pivot_table() an aggfunc='sum' or aggfunc=np.sum to get the sum rather than the average, which will give us 480.0 and 36.0 and is probably a better representation of the total value a Person owns in Belongings of a certain type. But we're missing details.
We can use aggfunc=list which will preserve them:
df_pivot = df.pivot_table(
values='Value',
index='Person',
columns='Belonging',
aggfunc=list,
)
Belonging Bike Car House
Person
Adam NaN [10.0] [300.0, 180.0]
Cesar NaN [12.0] NaN
Diana [2.0] [15.0, 21.0] [450.0]
Erika NaN [11.0] [600.0]
This keeps the detail about multiple Belongings per Person, but on the other hand is quite inconvenient in that it is using Python lists rather than native Pandas types and columns, so it makes some queries such as the total Values in Houses difficult to answer.
Using aggfunc=np.sum, we could simply use pd_pivot['House'].sum() to get the total of 1530.0. Even questions such as the one above, Cars for Persons with a House worth more than 400.0 are now harder to answer.
What's a better way to reshape this data that will:
Allow easy querying a Person's Belongings in a single row, like the pivot_table() does;
Preserve the details of Persons who have multiple Belongings of a certain type;
Use native Pandas columns and data types that make it possible to use Pandas methods for querying and summarizing the data.
I thought of updating the Belonging descriptions to include a counter, such as "House 1", "Car 2", etc. Perhaps sorting so that the most valuable one comes first (to help answer questions such as "has a house worth more than 400.0" looking at "House 1" only.)
Or perhaps using a pd.MultiIndex to still be able to access all "House" columns together.
But unsure how to actually reshape the data in such a way.
Or are there better suggestions on how to reshape it (other than adding a count per belonging) that would preserve the features described above? How would you reshape it and how would you answer all these queries I mentioned above?
Perhaps sth like this:
given your Pivot table in the following dataframe:
pv = df_pivot = df.pivot_table(
values='Value',
index='Person',
columns='Belonging',
aggfunc=list,
)
then apply pd.Series to all columns.
For proper naming of columns, calculate maximum length of lists in each column and then use 'set_axis' for renaming:
new_pv = pd.DataFrame(index=pv.index)
for col in pv:
n = int(pv[col].str.len().max())
new_pv = pd.concat([new_pv, pv[col].apply(pd.Series).set_axis([f'{col}_{i}' for i in range(n)], 1, inplace = False)], 1)
# Bike_0 Car_0 Car_1 House_0 House_1
# Person
# Adam NaN 10.0 NaN 300.0 180.0
# Cesar NaN 12.0 NaN NaN NaN
# Diana 2.0 15.0 21.0 450.0 NaN
# Erika NaN 11.0 NaN 600.0 NaN
counting of houses:
new_pv.filter(like='House').count(1)
# Person
# Adam 2
# Cesar 0
# Diana 1
# Erika 1
# dtype: int64
sum of all house's values:
new_pv.filter(like='House').sum().sum()
# 1530.0
Using groupby, you could achieve something like this.
df_new = df.groupby(['Person', 'Belonging']).agg(('sum', 'count', 'min', 'max'))
which would give.
Value
sum count min max
Person Belonging
Adam Car 10.0 1 10.0 10.0
House 480.0 2 180.0 300.0
Cesar Car 12.0 1 12.0 12.0
Diana Bike 2.0 1 2.0 2.0
Car 36.0 2 15.0 21.0
House 450.0 1 450.0 450.0
Erika Car 11.0 1 11.0 11.0
House 600.0 1 600.0 600.0
You could define your own functions in the .agg method to provide more suitable descriptions also.
Edit
Alternatively, you could try
df['Belonging'] = df["Belonging"] + "_" + df.groupby(['Person','Belonging']).cumcount().add(1).astype(str)
Person Belonging Value
0 Adam House_1 300.0
1 Adam Car_1 10.0
2 Cesar Car_1 12.0
3 Diana House_1 450.0
4 Diana Car_1 15.0
5 Diana Bike_1 2.0
6 Erika House_1 600.0
7 Erika Car_1 11.0
8 Diana Car_2 21.0
9 Adam House_2 180.0
Then you can just use pivot
df.pivot('Person', 'Belonging')
Value
Belonging Bike_1 Car_1 Car_2 House_1 House_2
Person
Adam NaN 10.0 NaN 300.0 180.0
Cesar NaN 12.0 NaN NaN NaN
Diana 2.0 15.0 21.0 450.0 NaN
Erika NaN 11.0 NaN 600.0 NaN
I ended up working out a solution to this one, inspired by the excellent answers by #SpghttCd and #Josmoor98, but with a couple differences:
Using a MultiIndex, so I have a really easy way to get all Houses or all Cars.
Sorting values, so looking at the first House or Car can be used to tell who has one worth more than X.
Code for the pivot table:
df_pivot = (df
.assign(BelongingNo=df
.sort_values(by='Value', ascending=False)
.groupby(['Person', 'Belonging'])
.cumcount() + 1
)
.pivot_table(
values='Value',
index='Person',
columns=['Belonging', 'BelongingNo'],
)
)
Resulting DataFrame:
Belonging Bike Car House
BelongingNo 1 1 2 1 2
Person
Adam NaN 10.0 NaN 300.0 180.0
Cesar NaN 12.0 NaN NaN NaN
Diana 2.0 21.0 15.0 450.0 NaN
Erika NaN 11.0 NaN 600.0 NaN
Queries are pretty straightforward.
For example, finding the Value of Person's Cars, if they have a House valued more than 400.0:
df_pivot.loc[
df_pivot[('House', 1)] > 400.0,
'Car'
]
Result:
BelongingNo 1 2
Person
Diana 21.0 15.0
Erika 11.0 NaN
The average Car price for them:
df_pivot.loc[
df_pivot[('House', 1)] > 400.0,
'Car'
].stack().mean()
Result: 15.6666
Here, using stack() is a powerful way to flatten the second level of the MultiIndex, after having used the top-level to select a Belonging column.
Same is useful to get the total Value of all Houses:
df_pivot['House'].sum()
Results in the expected 1530.0.
Finally, looking at all Belongings of a single Person:
df_pivot.loc['Adam'].dropna()
Returns the expected two Houses and the one Car, with their respective Values.
I tried doing this with the lists in the dataframe, so that they get converted to ndarrays.
pd_df_pivot = df_pivot.copy(deep=True)
for row in range(0,df_pivot.shape[0]):
for col in range(0,df_pivot.shape[1]):
if type(df_pivot.iloc[row,col]) is list:
pd_df_pivot.iloc[row,col] = np.array(df_pivot.iloc[row,col])
else:
pd_df_pivot.iloc[row,col] = df_pivot.iloc[row,col]

Replace missing values based on another column

I am trying to replace the missing values in a dataframe based on filtering of another column, "Country"
>>> data.head()
Country Advanced skiers, freeriders Snow parks
0 Greece NaN NaN
1 Switzerland 5.0 5.0
2 USA NaN NaN
3 Norway NaN NaN
4 Norway 3.0 4.0
Obviously this is just a small snippet of the data, but I am looking to replace all the NaN values with the average value for each feature.
I have tried grouping the data by the country and then calculating the mean of each column. When I print out the resulting array, it comes up with the expected values. However, when I put it into the .fillna() method, the data appears unchanged
I've tried #DSM's solution from this similar post, but I am not sure how to apply it to multiple columns.
listOfRatings = ['Advanced skiers, freeriders', 'Snow parks']
print (data.groupby('Country')[listOfRatings].mean().fillna(0))
-> displays the expected results
data[listOfRatings] = data[listOfRatings].fillna(data.groupby('Country')[listOfRatings].mean().fillna(0))
-> appears to do nothing to the dataframe
Assuming this is the complete dataset, this is what I would expect the results to be.
Country Advanced skiers, freeriders Snow parks
0 Greece 0.0 0.0
1 Switzerland 5.0 5.0
2 USA 0.0 0.0
3 Norway 3.0 4.0
4 Norway 3.0 4.0
Can anyone explain what I am doing wrong, and how to fix the code?
You can use transform for return new DataFrame with same size as original filled by aggregated values:
print (data.groupby('Country')[listOfRatings].transform('mean').fillna(0))
Advanced skiers, freeriders Snow parks
0 0.0 0.0
1 5.0 5.0
2 0.0 0.0
3 3.0 4.0
4 3.0 4.0
#dynamic generate all columns names without Country
listOfRatings = data.columns.difference(['Country'])
df1 = data.groupby('Country')[listOfRatings].transform('mean').fillna(0)
data[listOfRatings] = data[listOfRatings].fillna(df1)
print (data)
print (data)
Country Advanced skiers, freeriders Snow parks
0 Greece 0.0 0.0
1 Switzerland 5.0 5.0
2 USA 0.0 0.0
3 Norway 3.0 4.0
4 Norway 3.0 4.0

How do I fill these `NaN` values properly?

Here's my original dataframe with NaN values which I'm trying to fill;
https://prnt.sc/i40j33
If I use df.interpolate(axis=1) to fill up the NaN values, only some of the rows fill up properly with a number.
For e.g
https://prnt.sc/i40mgq
As you can see in the screenshot column:1981 and row:3 which had a NaN value has filled up properly with a value other than NaN. I want to fill the rest of NaN as well like that? Any idea how do I do that?
Using DataFrame.interpolate()
In your case it is failing because there are no columns to the left, and therefore the interpolate method doesn't know what to interpolate it to: missing_value = (left_value + right_value)/2
So you could, for example, insert a column to the left with all 0's (if you would like to impute your missing values on the first column with half of the next value), as such:
df.insert(loc=0, column='allZeroes', value=0)
After this, you could interpolate as you are doing and remove the column
General missing value imputation
Either use df.fillna('DEFAULT-VALUE') as Alex mentioned in the comments to the question. Docs here
or do something like:
df.my_col[df.my_col.isnull()] = 'DEFAULT-VALUE'
I'd recommend using the fillna as you can use methods such as forward fill (ffill) -- impute the missings with the previous value -- and other similar methods.
It seems like you might want to interpolate on axis=0, column-wise:
>>> df = pd.DataFrame(np.arange(35, dtype=float).reshape(5,7),
columns=[1951, 1961, 1971, 1981, 1991, 2001, 2001],
index=range(0, 5))
>>> df.iloc[1:3, 0] = np.nan
>>> df.iloc[3, 3] = np.nan
>>> df.interpolate(axis=0)
1951 1961 1971 1981 1991 2001 2001
0 0.0 1.0 2.0 3.0 4.0 5.0 6.0
1 7.0 8.0 9.0 10.0 11.0 12.0 13.0
2 14.0 15.0 16.0 17.0 18.0 19.0 20.0
3 21.0 22.0 23.0 24.0 25.0 26.0 27.0
4 28.0 29.0 30.0 31.0 32.0 33.0 34.0
Currently you're interpolating row-wise. NaNs that "begin" a Series aren't padded by a value on either side, making interpolation impossible for them.
Update: pandas is adding some more optionality for this in v 0.23.0.

Resources