How to plot based upon unique column values? - python-3.x

I'm a beginner learning to use python to do data visualizations.
I found a really cool data set by the UN it is formatted like this:
Afghanistan 1975 2127
Afghanistan 1985 3509
Afghanistan 1995 1243
Afghanistan 2005 1327
Albania 1975 4595
Albania 1985 7880
Albania 1995 2087
Albania 2005 4254
etc...
Up until now, I've been parsing out individual countries with statements like this:
china = data[data.area == 'China']
This is fine for picking individual countries but now, I want to plot all of them. How could I go about that?
So far I've tried this but couldn't figure out how to make it work:
old_value = data.iloc[0]
for i in len(data):
if data.iloc[i].area == old_value:
# add to current set
else:
# create new set
Any help would be much appreciated!

Given your data
Setup imports and dataframe
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# plot parameters
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = (16.0, 10.0)
data = {'country': ['Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Albania', 'Albania', 'Albania', 'Albania'],
'year': [1975, 1985, 1995, 2005, 1975, 1985, 1995, 2005],
'value': [2127, 3509, 1243, 1327, 4595, 7880, 2087, 4254]}
df = pd.DataFrame(data)
country year value
0 Afghanistan 1975 2127
1 Afghanistan 1985 3509
2 Afghanistan 1995 1243
3 Afghanistan 2005 1327
4 Albania 1975 4595
5 Albania 1985 7880
6 Albania 1995 2087
7 Albania 2005 4254
Use seaborn.barplot with the hue parameter
p = sns.barplot(x='year', y='value', hue='country', data=df)
Horizontally
p = sns.barplot(x='value', y='year', hue='country', data=df, orient='h')
A separate plot for each country
Using plt.subplot(1, 2, i) the rows times the columns should equal the number of unique countries or +1 if there are an odd number.
max_value = df.value.max() + 100 # + 100 to add padding at the top of the plot; 100 is an arbitrary value and can be removed
for i, country in enumerate(df.country.unique(), 1): # iterate through each unique country
data = df[df.country == country] # filter by country
plt.subplot(1, 2, i) # rows, columns, i: plot index beginning at 1
sns.barplot(x='year', y='value', data=data)
plt.ylim(0, max_value) # set y-lim with max of the value column; makes it easier to compare countries
plt.title(country)

Related

Groupby and calculate count and means based on multiple conditions in Pandas

For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)

difference between two column of a dataframe

I am new to python and would like to find out the difference between two column of a dataframe.
What I want is to find the difference between two column along with a respective third column. For example, I have a dataframe Soccer which contains the list of all the team playing soccer with the goals against and for their club. I wanted to find out the goal difference along with the team name. i.e. (Goals Diff=goalsFor-goalsAgainst).
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Atletico Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
I tried creating a function and then iterating through each row of a dataframe as below:
for index, row in football.iterrows():
##pdb.set_trace()
goalsFor=row['GoalsFor']
goalsAgainst=row['GoalsAgainst']
teamName=row['Team']
if not total:
totals=np.array(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
else:
total= total.append(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
return total
def Goal_diff_count_Formal(gFor, gAgainst, team):
goalsDifference=gFor-gAgainst
return [team, goalsDifference]
However, I would like to know if there is a quickest way to get this, something like
dataframe['goalsFor'] - dataframe['goalsAgainst'] #along with the team name in the dataframe
Solution if unique values in Team column - create index by Team, get difference and select Team by index:
df = df.set_index('Team')
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Real Madrid 2807
Barcelona 2786
Atletico Madrid 1225
dtype: int64
print (s['Atletico Madrid'])
1225
Solution if possible duplicated values in Team column:
I believe you need grouping by Team and aggregate sum first and then get difference:
#change sample data for Team in row 3
print (df)
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Real Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
df = df.groupby('Team')['GoalsFor','GoalsAgainst'].sum()
df['diff'] = df['GoalsFor'] - df['GoalsAgainst']
print (df)
GoalsFor GoalsAgainst diff
Team
Barcelona 5900 3114 2786
Real Madrid 10481 6449 4032
EDIT:
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Barcelona 2786
Real Madrid 4032
dtype: int64
print (s['Barcelona'])
2786

How to access a column of grouped data to perform linear regression in pandas?

I want to perform a linear regression on groupes of grouped data frame in pandas. The function I am calling throws a KeyError that I cannot resolve.
I have an environmental data set called dat that includes concentration data of a chemical in different tree species of various age classes in different country sites over the course of several time steps. I now want to do a regression of concentration over time steps within each group of (site, species, age).
This is my code:
```
import pandas as pd
import statsmodels.api as sm
dat = pd.read_csv('data.csv')
dat.head(15)
SampleName Concentration Site Species Age Time_steps
0 batch1 2.18 Germany pine 1 1
1 batch2 5.19 Germany pine 1 2
2 batch3 11.52 Germany pine 1 3
3 batch4 16.64 Norway spruce 0 1
4 batch5 25.30 Norway spruce 0 2
5 batch6 31.20 Norway spruce 0 3
6 batch7 12.63 Norway spruce 1 1
7 batch8 18.70 Norway spruce 1 2
8 batch9 43.91 Norway spruce 1 3
9 batch10 9.41 Sweden birch 0 1
10 batch11 11.10 Sweden birch 0 2
11 batch12 15.73 Sweden birch 0 3
12 batch13 16.87 Switzerland beech 0 1
13 batch14 22.64 Switzerland beech 0 2
14 batch15 29.75 Switzerland beech 0 3
def ols_res_grouped(group):
xcols_const = sm.add_constant(group['Time_steps'])
linmod = sm.OLS(group['Concentration'], xcols_const).fit()
return linmod.params[1]
grouped = dat.groupby(['Site','Species','Age']).agg(ols_res_grouped)
```
I want to get the regression coefficient of concentration data over Time_steps but get a KeyError: 'Time_steps'. How can the sm method access group["Time_steps"]?
According to pandas's documentation, agg applies functions to each column independantly.
It might be possible to use NamedAgg but I am not sure.
I think it is a lot easier to just use a for loop for this :
for _, group in dat.groupby(['Site','Species','Age']):
coeff = ols_res_grouped(group)
# if you want to put the coeff inside the dataframe
dat.loc[group.index, 'coeff'] = coeff

How to extract a keyword(string) from a column in pandas dataframe in python

I have a dataframe df and it looks like this:
id Type agent_id created_at
0 44525 Stunning 6 bedroom villa in New Delhi 184 2018-03-09
1 44859 Villa for sale in Amritsar 182 2017-02-19
2 45465 House in Faridabad 154 2017-04-17
3 50685 5 Hectre land near New Delhi 113 2017-09-01
4 130728 Duplex in Mumbai 157 2017-02-07
5 130856 Large plot with fantastic views in Mumbai 137 2018-01-16
6 130857 Modern Design Penthouse in Bangalore 199 2017-03-24
I've this tabular data and I'm trying to clean this data by extracting keywords from the column and hence create a new dataframe with new columns.
Apartment = ['apartment', 'penthouse', 'duplex']
House = ['house', 'villa', 'country estate']
Plot = ['plot', 'land']
Location = ['New Delhi','Mumbai','Bangalore','Amritsar']
So the desired dataframe shoul look like this:
id Type Location agent_id created_at
0 44525 House New Delhi 184 2018-03-09
1 44859 House Amritsar 182 2017-02-19
2 45465 House Faridabad 154 2017-04-17
3 50685 Plot New Delhi 113 2017-09-01
4 130728 Apartment Mumbai 157 2017-02-07
5 130856 Plot Mumbai 137 2018-01-16
6 130857 Apartment Bangalore 199 2017-03-24
So till now i've tried this:
import pandas as pd
df = pd.read_csv('test_data.csv')
#i can extract these keywords one by one by using for loops but how
#can i do this work in pandas with minimum possible line of code.
for index, values in df.type.iteritems():
for i in Apartment:
if i in values:
print(i)
df_new = pd. Dataframe(df['id'])
Can someone tell me how to solve this?
First create Location column by str.extract with | for regex OR:
pat = '|'.join(r"\b{}\b".format(x) for x in Location)
df['Location'] = df['Type'].str.extract('('+ pat + ')', expand=False)
Then create dictionary from another lists, swap keys with values and in loop set value by mask with str.contains and parameter case=False:
d = {'Apartment' : Apartment,
'House' : House,
'Plot' : Plot}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
for k, v in d1.items():
df.loc[df['Type'].str.contains(k, case=False), 'Type'] = v
print (df)
id Type agent_id created_at Location
0 44525 House 184 2018-03-09 New Delhi
1 44859 House 182 2017-02-19 Amritsar
2 45465 House 154 2017-04-17 NaN
3 50685 Plot 113 2017-09-01 New Delhi
4 130728 Apartment 157 2017-02-07 Mumbai
5 130856 Plot 137 2018-01-16 Mumbai
6 130857 Apartment 199 2017-03-24 Bangalore
106 if isna(key).any():
--> 107 raise ValueError('cannot index with vector containing '
108 'NA / NaN values')
109 return False
ValueError: cannot index with vector containing NA / NaN values
I got above error

Plotting line graph on the same figure using matplotlib [duplicate]

I have a temperature file with many years temperature records, in a format as below:
2012-04-12,16:13:09,20.6
2012-04-12,17:13:09,20.9
2012-04-12,18:13:09,20.6
2007-05-12,19:13:09,5.4
2007-05-12,20:13:09,20.6
2007-05-12,20:13:09,20.6
2005-08-11,11:13:09,20.6
2005-08-11,11:13:09,17.5
2005-08-13,07:13:09,20.6
2006-04-13,01:13:09,20.6
Every year has different numbers, time of the records, so the pandas datetimeindices are all different.
I want to plot the different year's data in the same figure for comparing . The X-axis is Jan to Dec, the Y-axis is temperature. How should I go about doing this?
Try:
ax = df1.plot()
df2.plot(ax=ax)
If you a running Jupyter/Ipython notebook and having problems using;
ax = df1.plot()
df2.plot(ax=ax)
Run the command inside of the same cell!! It wont, for some reason, work when they are separated into sequential cells. For me at least.
Chang's answer shows how to plot a different DataFrame on the same axes.
In this case, all of the data is in the same dataframe, so it's better to use groupby and unstack.
Alternatively, pandas.DataFrame.pivot_table can be used.
dfp = df.pivot_table(index='Month', columns='Year', values='value', aggfunc='mean')
When using pandas.read_csv, names= creates column headers when there are none in the file. The 'date' column must be parsed into datetime64[ns] Dtype so the .dt extractor can be used to extract the month and year.
import pandas as pd
# given the data in a file as shown in the op
df = pd.read_csv('temp.csv', names=['date', 'time', 'value'], parse_dates=['date'])
# create additional month and year columns for convenience
df['Year'] = df.date.dt.year
df['Month'] = df.date.dt.month
# groupby the month a year and aggreate mean on the value column
dfg = df.groupby(['Month', 'Year'])['value'].mean().unstack()
# display(dfg)
Year 2005 2006 2007 2012
Month
4 NaN 20.6 NaN 20.7
5 NaN NaN 15.533333 NaN
8 19.566667 NaN NaN NaN
Now it's easy to plot each year as a separate line. The OP only has one observation for each year, so only a marker is displayed.
ax = dfg.plot(figsize=(9, 7), marker='.', xticks=dfg.index)
To do this for multiple dataframes, you can do a for loop over them:
fig = plt.figure(num=None, figsize=(10, 8))
ax = dict_of_dfs['FOO'].column.plot()
for BAR in dict_of_dfs.keys():
if BAR == 'FOO':
pass
else:
dict_of_dfs[BAR].column.plot(ax=ax)
This can also be implemented without the if condition:
fig, ax = plt.subplots()
for BAR in dict_of_dfs.keys():
dict_of_dfs[BAR].plot(ax=ax)
You can make use of the hue parameter in seaborn. For example:
import seaborn as sns
df = sns.load_dataset('flights')
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121
.. ... ... ...
139 1960 Aug 606
140 1960 Sep 508
141 1960 Oct 461
142 1960 Nov 390
143 1960 Dec 432
sns.lineplot(x='month', y='passengers', hue='year', data=df)

Resources