I'm plotting a bar chart with data that I have in a pandas.DataFrame. My code is as follows
import pandas as pd
import matplotlib.pyplot as plot
from datetime import datetime
start_year = 2000
date_range = [ i + start_year for i in range(datetime.today().year - start_year)]
data = pd.DataFrame([
[2015, 100], [2016, 110], [2017, 105], [2018, 109], [2019, 110], [2020, 116], [2021, 113]
], columns=["year", "value"])
chart = data.plot.bar(
x="year",
y="value",
# xticks=date_range # ,
xlim=[date_range[0], date_range[-1]]
)
plot.show()
The resulting plot is:
I have to plot several of these, for which data may start from 2000 and finish in 2010, then another dataframe that has data that starts in 2010 and ends in the current year.
In order to make these plots visually comparable, I would like for all to start at the same year, 2000 in this example, and finish the current year. If no value is present for a given year, then 0 can be used. In this case, as example, I've used the year 2000, but it could also start from the year 2005, 2006 or 2010.
How can I achieve what I'm looking for? I've tried setting xticks and xlim, but with xticks, the data gets skewed all towards one side, as if there were thousands of values in between. It is strange since I'm using int values.
Thanks
You can prepare your dataframe so that it has all years you want. right merge() to a dataframe that has all required years
data = pd.DataFrame([
[2015, 100], [2016, 110], [2017, 105], [2018, 109], [2019, 110], [2020, 116], [2021, 113]
], columns=["year", "value"])
# NB range is zero indexed, hence endyear + 1
data.merge(pd.DataFrame({"year":range(2010,2021+1)}), on="year", how="right").plot(kind="bar", x="year", y="value")
Related
I have data from a University where each entry is a student with the columns (first name, last name, major, sex, etc.)
I have created an aggregation of counts of male and females in each major:
gender_counts = (only_science.groupby(['no_concentration', 'sex'], as_index=False)
.size()
.unstack(fill_value=0)
.sort_values('Female', ascending=False)
)
Output:
DataFrame
Here is the plot that I created:
ax3 = gender_counts.plot(kind='bar', title='Gender Breakdown by Major')
ax3.set_xlabel("CoS Majors")
ax3.set_ylabel("Number of Applicants")
plt.show()
Output: Majors Plot by Gender
Goal: Create individual graphs of each major using the aggregated data so that the scale can be more meaningful and not be skewed by Biological Sciences.
I have tried to use sns.FacetGrid() and FacetGrid.map() and also tried sns.catplot() but I'm not sure what use for the parameters, and get a plethora of errors.
If I can create a bar chart for one of the majors then I can just create a for loop to iterate over gender_counts and make all of the bar charts.
Thank you for your help and I apologize if there are elements missing from this question. This is my first stack overflow question.
You can use sns.catplot with sharey=False:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'no_concentration': {0: 'Biological Sciences',1: 'Pre-Medicine',2: 'Biochemistry',3: 'Pre-Dentistry',4: 'Chemistry',
5: 'Mathematics',6: 'Physics',7: 'Microbiology',8: 'Geology',9: 'Biological Sciences',10: 'Pre-Medicine',
11: 'Biochemistry',12: 'Pre-Dentistry',13: 'Chemistry',14: 'Mathematics',15: 'Physics',16: 'Microbiology',17: 'Geology'},
'Sex': {0: 'Female',1: 'Female',2: 'Female',3: 'Female',4: 'Female',5: 'Female',6: 'Female',7: 'Female',8: 'Female',9: 'Male',10:
'Male',11: 'Male',12: 'Male',13: 'Male',14: 'Male',15: 'Male', 16: 'Male',17: 'Male'},
'value': {0: 1282,1: 1267, 2: 291, 3: 187, 4: 175, 5: 89, 6: 75, 7: 57,8: 18,9: 534,10: 445,11: 122,12: 76,13: 80,14: 76,15: 118,16: 29,17: 31}})
sns.set_context('paper', font_scale=1.4)
sns.catplot(data=df, x='Sex', y='value', col='no_concentration', kind='bar',
col_wrap=3, palette=sns.color_palette("icefire"), sharey=False)
plt.figure(figsize=(16, 8))
plt.style.use('dark_background') #I use dark mode in jupyter notebook, so I need to use this line, but you can omit.
plt.show()
I'm creating a stacked bar chart using the count of a categorical field in a dataframes column.
chart = alt.Chart(df2).mark_bar().encode(
x="take__take:O",
y=alt.Y('count(name)', stack="normalize", axis=alt.Axis(title="Percent", format="%")),
color=alt.Color('name', sort=alt.EncodingSortField('value', order='descending')),
order=alt.Order(
'value',
sort="ascending"
),
tooltip=[
alt.Tooltip('count(name)', title="Total Students")
]
)
How would I go about getting the normalized count in the tooltip?
Up until now your chart uses encoding shorthands to compute various aggregates; for more complicated operations (like displaying normalized values in tooltips) you will need to use transforms directly.
Here is an example of displaying per-group percentages in a tooltip, using a chart similar to what you showed above:
import altair as alt
import numpy as np
import pandas as pd
np.random.seed(0)
df2 = pd.DataFrame({
'name': np.random.choice(['A', 'B', 'C', 'D'], size=100),
'value': np.random.randint(0, 20, 100),
'take__take': np.random.randint(0, 5, 100)
})
alt.Chart(df2).transform_aggregate(
count='count()',
groupby=['name', 'take__take']
).transform_joinaggregate(
total='sum(count)',
groupby=['take__take']
).transform_calculate(
frac=alt.datum.count / alt.datum.total
).mark_bar().encode(
x="take__take:O",
y=alt.Y('count:Q', stack="normalize", axis=alt.Axis(title="Percent", format="%")),
color='name:N',
tooltip=[
alt.Tooltip('count:Q', title="Total Students"),
alt.Tooltip('frac:Q', title="Percentage of Students", format='.0%')
]
)
I'm trying to annotate a chart to include the plotted values of the x-axis as well as additional information from the DataFrame. I am able to annotate the values from the x-axis but not sure how I can add additional information from the data frame. In my example below I am annotating the x-axis which are the values from the Completion column but also want to add the Completed and Participants values from the DataFrame.
For example the Running Completion is 20% but I want my annotation to show the Completed and Participants values in the format - 20% (2/10). Below is sample code that can reproduce my scenario as well as current and desired results. Any help is appreciated.
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
mydict = {
'Event': ['Running', 'Swimming', 'Biking', 'Hiking'],
'Completed': [2, 4, 3, 7],
'Participants': [10, 20, 35, 10]}
df = pd.DataFrame(mydict).set_index('Event')
df = df.assign(Completion=(df.Completed/df.Participants) * 100)
print(df)
plt.subplots(figsize=(5, 3))
ax = sns.barplot(x=df.Completion, y=df.index, color="cyan", orient='h')
for i in ax.patches:
ax.text(i.get_width() + .4,
i.get_y() + .67,
str(round((i.get_width()), 2)) + '%', fontsize=10)
plt.tight_layout()
plt.show()
DataFrame:
Completed Participants Completion
Event
Running 2 10 20.000000
Swimming 4 20 20.000000
Biking 3 35 8.571429
Hiking 7 10 70.000000
Current Output:
Desired Output:
Loop through the columns Completed and Participants as well when you annotate:
for (c,p), i in zip(df[["Completed","Participants"]].values, ax.patches):
ax.text(i.get_width() + .4,
i.get_y() + .67,
str(round((i.get_width()), 2)) + '%' + f" ({c}/{p})", fontsize=10)
I am trying to show migration from locations in a Sankey diagram in Holoviews, but I can't find a way to add a dropdown-type filter. I am not allowed to list a higher number of key dimensions than what I am plotting, which I expected to work as I get dropdown menu in other HoloViews elements as it automatically groups my data by all the key dimensions I did not assign to the element.
import pandas as pd
import numpy as np
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
df = pd.DataFrame({'from': ["a", "b", "c", "a", "b", "c"],
'to': ["d", "d", "e", "e", "e", "d"],
'number': [10, 2, 1, 8, 2, 2],
'year': [2018, 2018, 2018, 2017, 2017, 2017]})
df
from to number year
0 a d 10 2018
1 b d 2 2018
2 c e 1 2018
3 a e 8 2017
4 b e 2 2017
5 c d 2 2017
Now to Holoviews adding the year column to kdims as I want the dropdown to filter by year:
kdims = ["from", "to", "year"]
vdims = ["number"]
sankey = hv.Sankey(df, kdims=kdims, vdims=vdims)
sankey.opts(label_position='left', edge_color='to', node_padding=30, node_color='number', cmap='tab20')
returning:
ValueError: kdims: list length must be between 2 and 2 (inclusive)
Without the third key dimension the Sankey diagram work as expected, but then there is no interactive filter:
Here's 2 ways of solving your problem:
1) Turn your dataframe into a holoviews dataset and turn that into a Sankey plot:
Since 'year' is in the code below the 3rd key dimension, it will be used as the dimension for the slider. The first 2 variables ('from' and 'to') will be used as the key dims for the Sankey plot.
hv_ds = hv.Dataset(
data=df,
kdims=['from', 'to', 'year'],
vdims=['number'],
)
hv_ds.to(hv.Sankey)
2) Or, create a dictionary of Sankey plots per year and put those into a holomap:
sankey_dict = {
year: hv.Sankey(df[df.year == year])
for year in df.year.unique()
}
holo = hv.HoloMap(sankey_dict, kdims='year')
Both solutions create a holomap:
http://holoviews.org/reference/containers/bokeh/HoloMap.html
Resulting plot + slider:
I've tested this on:
hvplot 0.5.2
holoviews 1.12.5 and holoviews 1.13
jupyterlab 1.2.4
I have a data that has several columns in it.
Country Weight # of food/day ....
---------------------------------------------
USA 180 4
China 190 12
USA 150 2
Canada 300 10
I want to create (separate) histogram for each of the columns such that histogram_1 shows the distribution of 'Country', histogram_2 shows the distribution of 'Weight', etc.
I'm currently using panda to upload and manipulate the data.
Is the easy way to do this is by doing like this?
for column in df:
plt.hist(column)
plt.show()
Please forgive me if my idea sounds so stupid.
Any help would be highly appreciated, thanks!
Defining a histogram for non-numeric or discrete values is not unambiguous. Often the question is "how many item of each unique kind are there?". This can be achieved through .value_counts. Since you say "# of histograms == # of columns (features)", we might create one subplot per column.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"Countries" : ["USA", "Mexico", "Canada", "USA", "Mexico"],
"Weight" : [180, 120, 100, 120, 130],
"Food" : [2,2,2,4,2]})
fig, axes = plt.subplots(ncols=len(df.columns), figsize=(10,5))
for col, ax in zip(df, axes):
df[col].value_counts().sort_index().plot.bar(ax=ax, title=col)
plt.tight_layout()
plt.show()
Can use this instead of for loop, histograms for all numeric columns will be generated!
df.hist(bins=10, figsize=(25, 20))
If you want the histograms in different windows, then you can do in this way:
df.set_index('Country', inplace=True)
for col in df.columns:
df[col].plot.bar()
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"Countries" : ["USA", "Mexico", "Canada", "USA", "Mexico"],
"Weight" : [200, 150, 190, 60, 40],
"Food" : [2,6,4,4,6]})
for col in df.columns:
plt.hist(df[col])
plt.xlabel(col)
plt.show()