How to create conditional plots of groupby objects using matplotlib/seaborn? - python-3.x

I have data from a University where each entry is a student with the columns (first name, last name, major, sex, etc.)
I have created an aggregation of counts of male and females in each major:
gender_counts = (only_science.groupby(['no_concentration', 'sex'], as_index=False)
.size()
.unstack(fill_value=0)
.sort_values('Female', ascending=False)
)
Output:
DataFrame
Here is the plot that I created:
ax3 = gender_counts.plot(kind='bar', title='Gender Breakdown by Major')
ax3.set_xlabel("CoS Majors")
ax3.set_ylabel("Number of Applicants")
plt.show()
Output: Majors Plot by Gender
Goal: Create individual graphs of each major using the aggregated data so that the scale can be more meaningful and not be skewed by Biological Sciences.
I have tried to use sns.FacetGrid() and FacetGrid.map() and also tried sns.catplot() but I'm not sure what use for the parameters, and get a plethora of errors.
If I can create a bar chart for one of the majors then I can just create a for loop to iterate over gender_counts and make all of the bar charts.
Thank you for your help and I apologize if there are elements missing from this question. This is my first stack overflow question.

You can use sns.catplot with sharey=False:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'no_concentration': {0: 'Biological Sciences',1: 'Pre-Medicine',2: 'Biochemistry',3: 'Pre-Dentistry',4: 'Chemistry',
5: 'Mathematics',6: 'Physics',7: 'Microbiology',8: 'Geology',9: 'Biological Sciences',10: 'Pre-Medicine',
11: 'Biochemistry',12: 'Pre-Dentistry',13: 'Chemistry',14: 'Mathematics',15: 'Physics',16: 'Microbiology',17: 'Geology'},
'Sex': {0: 'Female',1: 'Female',2: 'Female',3: 'Female',4: 'Female',5: 'Female',6: 'Female',7: 'Female',8: 'Female',9: 'Male',10:
'Male',11: 'Male',12: 'Male',13: 'Male',14: 'Male',15: 'Male', 16: 'Male',17: 'Male'},
'value': {0: 1282,1: 1267, 2: 291, 3: 187, 4: 175, 5: 89, 6: 75, 7: 57,8: 18,9: 534,10: 445,11: 122,12: 76,13: 80,14: 76,15: 118,16: 29,17: 31}})
sns.set_context('paper', font_scale=1.4)
sns.catplot(data=df, x='Sex', y='value', col='no_concentration', kind='bar',
col_wrap=3, palette=sns.color_palette("icefire"), sharey=False)
plt.figure(figsize=(16, 8))
plt.style.use('dark_background') #I use dark mode in jupyter notebook, so I need to use this line, but you can omit.
plt.show()

Related

pandas: draw plot using dict and labels on top of each bar

I am trying to plot a graph from a dict, which works fine but I also have a similar dict with values that I intend to write on top of each bar.
This works fine for plotting the graph:
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['axes.formatter.useoffset'] = False
df = pd.DataFrame([population_dct])
df.sum().sort_values(ascending=False).plot.bar(color='b')
plt.savefig("temp_fig.png")
Where the population_dct is:
{'pak': 210, 'afg': 182, 'ban': 94, 'ind': 32, 'aus': 14, 'usa': 345, 'nz': 571, 'col': 47, 'iran': 2}
Now I have another dict, called counter_dct:
{'pak': 1.12134, 'afg': 32.4522, 'ban': 3.44, 'ind': 1.123, 'aus': 4.22, 'usa': 9.44343, 'nz': 57.12121, 'col': 2.447, 'iran': 27.5}
I need the second dict items to be shown on top of each bar from the previous graph.
What I tried:
df = pd.DataFrame([population_dct])
df.sum().sort_values(ascending=False).plot.bar(color='g')
for i, v in enumerate(counter_dct.values()):
plt.text(v, i, " " + str(v), color='blue', va='center', fontweight='bold')
This has two issues:
counter_dct.values() msesses up with the sequence of values
The values are shown at the bottom of each graph with poor alignment
Perhaps there's a better way to achieve this?
Since you are drawing the graph in a desc manner;
You need to first sort the population_dict in a desc manner based on values
temp_dct = dict(sorted(population_dct.items(), key=lambda x: x[1], reverse=True))
Start with the temp_dct and then get the value from the counter_dct
counter = 0 # to start from the x-axis
for key, val in temp_dct.items():
top_val = counter_dct[key]
plt.text(x=counter, y=val + 2, s=f"{top_val}", fontdict=dict(fontsize=11))
counter += 1
plt.xticks(rotation=45, ha='right')

Networkx not showing all nodes in dataframe

I'm developing a social network based in exchange of emails, which dataset is a csv that can be downloaded at my Google Drive and consists of integers (individuals, column source) connecting to other individuals (integers, column target): https://drive.google.com/file/d/183fIXkGUqDC7YGGdxy50jAPrekaI1273/view?usp=sharing
The point is, my dataframe has 400 rows, but only 21 nodes show up:
Here is the sample code:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_csv('/home/......./social.csv', sep=',',header=None)
df=df.iloc[0:400,:]
df.columns=['source','target']
nodes=np.arange(0,400)
G=nx.from_pandas_edgelist(df, "source", "target")
G.add_nodes_from(nodes)
pos = nx.spectral_layout(G)
coordinates=np.concatenate(list(pos.values())).reshape(-1,2)
nx.draw_networkx_edges(G, pos, edgelist=[e for e in G.edges],alpha=0.9)
nx.draw_networkx_nodes(G, pos, nodelist=nodes)
plt.show()
Column source has 160 different individuals and target has 260 different individuals.
The whole algorithm is running right, this is the only issue:
I'm wondering what I'm doing wrong. Any insights are welcome.
Your nodes are being drawn but the the nx.spectral_layout positions them on top of each other.
If you print the positions:
pos = nx.spectral_layout(G)
print(pos)
You get:
{0: array([0.00927318, 0.01464153]), 1: array([0.00927318, 0.01464153]), 2: array([0.00927318, 0.01464153]), 3: array([0.00927318, 0.01464153]), 4: array([0.00927318, 0.01464153]), 5: array([-1. , -0.86684471]), 6: array([-1. , -0.86684471]), ...
And you can already see the overlap by comparing the positions.
You could instead use nx.circular_layout if you want to see all the nodes:
fig=plt.figure(figsize=(16,12))
pos = nx.circular_layout(G)
nx.draw(G, pos, nodelist=nodes,node_size=40)
And you will get:

Plot specific cells from a dataframe in Ploty

I have a dataframe of XY coordinates which I'm plotting as Markers in a Scatter plot. I'd like to add_trace lines between specific XY pairs, not between every pair. For example, I'd like a line between Index 0 and Index 3 and another between Index 1 and Index 2. This means that just using a line plot won't work as I don't want to show all the connections. Is it possible to do it with a version of iloc or do I need to make my DataFrame in 'Wide-format' and have each XY pair as separate column pairs?
I've read through this but I'm not sure it helps in my case.
Adding specific lines to a Plotly Scatter3d() plot
import pandas as pd
import plotly.graph_objects as go
# sample data
d={'MeanE': {0: 22.448461538460553, 1: 34.78435897435799, 2: 25.94307692307667, 3: 51.688974358974164},
'MeanN': {0: 110.71128205129256, 1: 107.71666666666428, 2: 140.6384615384711, 3: 134.58615384616363}}
# pandas dataframe
df=pd.DataFrame(d)
# set up plotly figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['MeanE'],y=df['MeanN'],mode='markers'))
fig.show()
UPDATE:
Adding the accepted answer below to what I had already, I now get the following finished plot.
taken approach of updating data frame rows that are the pairs of co-ordinates where you have defined
then add traces to figure to complete requirement as a list comprehension
import pandas as pd
import plotly.graph_objects as go
# sample data
d={'MeanE': {0: 22.448461538460553, 1: 34.78435897435799, 2: 25.94307692307667, 3: 51.688974358974164},
'MeanN': {0: 110.71128205129256, 1: 107.71666666666428, 2: 140.6384615384711, 3: 134.58615384616363}}
# pandas dataframe
df=pd.DataFrame(d)
# set up plotly figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['MeanE'],y=df['MeanN'],mode='markers'))
# mark of pairs that will be lines
df.loc[[0, 3], "group"] = 1
df.loc[[1, 2], "group"] = 2
# add the lines to the figure
fig.add_traces(
[
go.Scatter(
x=df.loc[df["group"].eq(g), "MeanE"],
y=df.loc[df["group"].eq(g), "MeanN"],
mode="lines",
)
for g in df["group"].unique()
]
)
fig.show()
alternate solution to enhanced requirement in comments
# mark of pairs that will be lines
lines = [[0, 3], [1, 2], [0,2],[1,3]]
# add the lines to the figure
fig.add_traces(
[
go.Scatter(
x=df.loc[pair, "MeanE"],
y=df.loc[pair, "MeanN"],
mode="lines",
)
for pair in lines
]
)

Pandas Plot Bar Fixed Range Missing Values

I'm plotting a bar chart with data that I have in a pandas.DataFrame. My code is as follows
import pandas as pd
import matplotlib.pyplot as plot
from datetime import datetime
start_year = 2000
date_range = [ i + start_year for i in range(datetime.today().year - start_year)]
data = pd.DataFrame([
[2015, 100], [2016, 110], [2017, 105], [2018, 109], [2019, 110], [2020, 116], [2021, 113]
], columns=["year", "value"])
chart = data.plot.bar(
x="year",
y="value",
# xticks=date_range # ,
xlim=[date_range[0], date_range[-1]]
)
plot.show()
The resulting plot is:
I have to plot several of these, for which data may start from 2000 and finish in 2010, then another dataframe that has data that starts in 2010 and ends in the current year.
In order to make these plots visually comparable, I would like for all to start at the same year, 2000 in this example, and finish the current year. If no value is present for a given year, then 0 can be used. In this case, as example, I've used the year 2000, but it could also start from the year 2005, 2006 or 2010.
How can I achieve what I'm looking for? I've tried setting xticks and xlim, but with xticks, the data gets skewed all towards one side, as if there were thousands of values in between. It is strange since I'm using int values.
Thanks
You can prepare your dataframe so that it has all years you want. right merge() to a dataframe that has all required years
data = pd.DataFrame([
[2015, 100], [2016, 110], [2017, 105], [2018, 109], [2019, 110], [2020, 116], [2021, 113]
], columns=["year", "value"])
# NB range is zero indexed, hence endyear + 1
data.merge(pd.DataFrame({"year":range(2010,2021+1)}), on="year", how="right").plot(kind="bar", x="year", y="value")

Plot a histogram through all the columns

I have a data that has several columns in it.
Country Weight # of food/day ....
---------------------------------------------
USA 180 4
China 190 12
USA 150 2
Canada 300 10
I want to create (separate) histogram for each of the columns such that histogram_1 shows the distribution of 'Country', histogram_2 shows the distribution of 'Weight', etc.
I'm currently using panda to upload and manipulate the data.
Is the easy way to do this is by doing like this?
for column in df:
plt.hist(column)
plt.show()
Please forgive me if my idea sounds so stupid.
Any help would be highly appreciated, thanks!
Defining a histogram for non-numeric or discrete values is not unambiguous. Often the question is "how many item of each unique kind are there?". This can be achieved through .value_counts. Since you say "# of histograms == # of columns (features)", we might create one subplot per column.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"Countries" : ["USA", "Mexico", "Canada", "USA", "Mexico"],
"Weight" : [180, 120, 100, 120, 130],
"Food" : [2,2,2,4,2]})
fig, axes = plt.subplots(ncols=len(df.columns), figsize=(10,5))
for col, ax in zip(df, axes):
df[col].value_counts().sort_index().plot.bar(ax=ax, title=col)
plt.tight_layout()
plt.show()
Can use this instead of for loop, histograms for all numeric columns will be generated!
df.hist(bins=10, figsize=(25, 20))
If you want the histograms in different windows, then you can do in this way:
df.set_index('Country', inplace=True)
for col in df.columns:
df[col].plot.bar()
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"Countries" : ["USA", "Mexico", "Canada", "USA", "Mexico"],
"Weight" : [200, 150, 190, 60, 40],
"Food" : [2,6,4,4,6]})
for col in df.columns:
plt.hist(df[col])
plt.xlabel(col)
plt.show()

Resources