Pandas Crosstab Plot Top N Elements - python-3.x

I have created a Pandas Crosstab with some data I am currently working with.
contingencyTable = pd.crosstab(index=dfBoolean['ColA'],
columns=dfBoolean['Category'],
margins=True)
Some of the categories from both the primaryDistribution and the categoricalColumnName have a TON of categories--way too many to plot nicely on a single graph. Is it possible to select the top n (say, 5, for example) categories to plot in the cross tab? Thanks!
plt.rcParams.update({'font.size': 12})
bPlot = contingencyTable.plot.bar(rot=0, figsize=(9.2, 5.7))
Here is some sample data:

Related

Pandas Series boolean maps and plotting

I am just trying to up my understanding of plotting Pandas Series data using Booleans to mask out values I don't want. I am not sure that what I have is the correct or efficient way to do it.
Don't get me wrong, I do get the chart I am after but are my assumptions on the syntax correct?
All I want to do is plot the non zero values on my chart. I have not formatted the charts as I would normally as this was just a test of Booleans and masking data and not for creating report grade charts.
If I masked this as a Pandas DataFrame I would do the following if df1 were my DataFrame.
I understand this and it makes sense that the df1[mask] returns my values as required
# Plot our graph with only items that are non-zero
fig = px.bar(df1[mask], x = 'Animals', y = 'Count')
fig.show()
Doing it as a Pandas Series
This is the snippet that creates the graph I require
# Plot our graph with only items that are non-zero
fig = px.bar(sf, x = sf.index[sf_mask], y = sf[sf_mask])
fig.show()
After my initial test with adding my mask to sf and getting an error. I deduced that I needed to add the mask against the x and y parameters. I take it this is because a Series is just a single column and the index is set as my "animals". Therefore by mapping the sf.index[sf_mask] I get the returned animals in the index and sf[sf_mask] returns me the values. failure to add either one would give a "ValueError" stating that the arguments should have the same length.
Here is what I did to test my workings
My initial imports and setting up Plotly as my plotting backend
import pandas as pd
import plotly.express as px
# Set our plotting backend to Plotly
pd.options.plotting.backend = "plotly"
I just created a test dataset from a dictionary
animals = {'rabbits' : 1,
'dogs' : 3,
'cats' : 0,
'ferrets' : 3,
'horses' : 8,
'goldfish' : 0,
'guinea_pigs' : 2,
'hamsters' : 6,
'mice' : 3,
'rats' : 0
}
Then converted it to a pandas Series
sf = pd.Series(animals)
I then create my boolean mask to mask out all our non-Zero entries on our Pandas Series
sf_mask = sf != 0
And if I then view the mask I can see I only get non zero values which is exactly what I am looking for.
sf[sf_mask]
Which outputs my non-zero items in my series.
rabbits 1
dogs 3
ferrets 3
horses 8
guinea_pigs 2
hamsters 6
mice 3
dtype: int64
If I plot without my Boolean mask 'sf_mask' using the following syntax I get my complete Pandas Series charted
# Plot our Series showing all items
fig = px.bar(sf, x = sf.index, y = sf)
fig.show()
Which outputs the following chart
If I plot with my Boolean mask 'sf_mask' using the following syntax I get the chart I want which excludes the gaps with zero value items.
# Plot our graph with only items that are non-zero
fig = px.bar(sf, x = sf.index[sf_mask], y = sf[sf_mask])
fig.show()
Which outputs the correct chart.
Your understanding of booleans and masking is correct.
You can simplify your syntax a little though: if you take a look at the plotly.express.bar documentation, you'll see that the arguments 'x' and 'y' are optional. You don't need to pass 'x' or 'y' because by default plotly.express will create the bars using the index of the Series as x and the values of the Series as y. You can also pass the masked series in place of the entire series.
For example, this will produce the same bar chart:
fig = px.bar(sf[sf>0])
fig.update_layout(showlegend=False)

How to group-by twice, preserve original columns, and plot

I have the following data sets (only sample is shown):
I want to find the most impactful exercise per area and then plot it via Seaborn barplot.
I use the following code to do so.
# Create Dataset Using Only Area, Exercise and Impact Level Chategories
CA_data = Data[['area', 'exercise', 'impact level']]
# Compute Mean Impact Level per Exercise per Area
mean_il_CA = CA_data.groupby(['area', 'exercise'])['impact level'].mean().reset_index()
mean_il_CA_hello = mean_il_CA.groupby('area')['impact level'].max().reset_index()
# Plot
cx = sns.barplot(x="impact level", y="area", data=mean_il_CA_hello)
plt.title('Most Impactful Exercises Considering Area')
plt.show()
The resulting dataset is:
This means that when I plot, on the y axis only the label relative to the area appears, NOT 'area label' + 'exercise label' like I would like.
How do I reinsert 'exercise column into my final dataset?
How do I get both the name of the area and the exercise on the y plot?
The problem of losing the values of 'exercise' when grouping by the maximum of 'area' can be solved by keeping the MultiIndex (i.e. not using reset_index) and using .transform to create a boolean mask to select the appropriate full rows of mean_il_CA that contain the maximum 'impact_level' values per 'area'. This solution is based on the code provided in this answer by unutbu. The full labels for the bar chart can be created by concatenating the labels of 'area' and 'exercise'.
Here is an example using the titanic dataset from the seaborn package. The variables 'class', 'embark_town', and 'fare' are used in place of 'area', 'exercise', and 'impact_level'. The categorical variables both contain three unique values: 'First', 'Second', 'Third', and 'Cherbourg', 'Queenstown', 'Southampton'.
import pandas as pd # v 1.2.5
import seaborn as sns # v 0.11.1
df = sns.load_dataset('titanic')
data = df[['class', 'embark_town', 'fare']]
data.head()
data_mean = data.groupby(['class', 'embark_town'])['fare'].mean()
data_mean
# Select max values in each class and create concatenated labels
mask_max = data_mean.groupby(level=0).transform(lambda x: x == x.max())
data_mean_max = data_mean[mask_max].reset_index()
data_mean_max['class, embark_town'] = data_mean_max['class'].astype(str) + ', ' \
+ data_mean_max['embark_town']
data_mean_max
# Draw seaborn bar chart
sns.barplot(data=data_mean_max,
x=data_mean_max['fare'],
y=data_mean_max['class, embark_town'])

How to add traces in plotly.express

I am very new to python and plotly.express, and I find it very confusing...
I am trying to use the principle of adding different traces to my figure, using example code shown here https://plotly.com/python/line-charts/, Line Plot Modes, #Create traces.
BUT I get my data from a .CSV file.
import plotly.express as px
import plotly as plotly
import plotly.graph_objs as go
import pandas as pd
data = pd.read_csv(r"C:\Users\x.csv")
fig = px.scatter(data, x="Time", y="OD", color="C-source", size="C:A 1 ratio")
fig = px.line(data, x="Time", y="OD", color="C-source")
fig.show()
The above lines produces scatter/line plots with the correct data, but the data is mixed together. I have data from 2 different sources marked by a column named "Strain" in my .csv file that I would like the chart to reflect.
Is the traces option a possible way to do it, or is there another way?
You can add traces using an Express plot by using .select_traces(). Something like:
fig.add_traces(
list(px.line(...).select_traces())
)
Note the need to convert to list, since .select_traces() returns a generator.
It looks like you probably want the lines with the scatter dots as well on a single plot?
You're setting fig to equal px.scatter() and then setting (changing) it to equal px.line(). When set to line, the scatter plot is overwritten.
You're already importing graph objects so you can use add_trace with go, something like this:
fig.add_trace(go.Scatter(x=data["Time"], y=data["OD"], mode='markers', marker=dict(color=data["C-source"], size=data["C:A 1 ratio"])))
Depending on how your data is set up, you may need to add each C-source separately doing something like:
x=data.query("C-source=='Term'")["Time"], ... , name='Term'`
Here's a few references with examples and options you can use to set up your scatter:
Scatter plot examples  
Marker styles  
Scatter arguments and attributes
You can use the apporach stated in Plotly: How to combine scatter and line plots using Plotly Express?
fig3 = go.Figure(data=fig1.data + fig2.data)
or a more convenient and scalable approach:
fig1.data and fig2.data are common tuples that hold all the info needed for a plot and the + just concatenates them.
# this will hold all figures until they are combined
all_figures = []
# data_collection: dictionary with Pandas dataframes
for df_label in data_collection:
df = data_collection[df_label]
fig = px.line(df, x='Date', y=['Value'])
all_figures.append(fig)
import operator
import functools
# now you can concatenate all the data tuples
# by using the programmatic add operator
fig3 = go.Figure(data=functools.reduce(operator.add, [_.data for _ in all_figures]))
fig3.show()
thanks for taking the time to help me out. I ended up with two solutions that worked, of which using "facet_col" to divide the plot into two subplots (1 for each strain) was the most simple solution.
https://plotly.com/python/axes/
Thanks. this worked for me also where Fig_Set_B is a list of scatter plots
# create a tuple of first line plots in first 6 plots from plot set Fig_Set_B`
fig_combined = go.Figure(data= tuple(Fig_Set_B[x].data[0] for x in range(6)) )
fig_combined.show()

How to change scatter plot marker color in plotting loop using pandas?

I'm trying to write a simple program that reads in a CSV with various datasets (all of the same length) and automatically plots them all (as a Pandas Dataframe scatter plot) on the same figure. My current code does this well, but all the marker colors are the same (blue). I'd like to figure out how to make a colormap so that in the future, if I have much larger data sets (let's say, 100+ different X-Y pairings), it will automatically color each series as it plots. Eventually, I would like for this to be a quick and easy method to run from the command line. I did not have luck reading the documentation or stack exchange, hopefully this is not a duplicate!
I've tried the recommendations from these posts:
1)Setting different color for each series in scatter plot on matplotlib
2)https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html
3) https://matplotlib.org/users/colormaps.html
However, the first one essentially grouped the data points according to their position on the x-axis and made those groups of data the same color (not what I want, each series of data is roughly a linearly increasing function). The second and third links seemed to have worked, but I don't like the colormap choices (e.g. "viridis", many colors are too similar and it's hard to distinguish data points).
This is a simplified version of my code so far (took out other lines that automatically named axes, etc. to make it easier to read). I've also removed any attempts I've made to specify a colormap, for more of a blank canvas feel:
''' Importing multiple scatter data and plotting '''
import pandas as pd
import matplotlib.pyplot as plt
### Data file path (please enter Dataframe however you like)
path = r'/Users/.../test_data.csv'
### Read in data CSV
data = pd.read_csv(path)
### List of headers
header_list = list(data)
### Set data type to float so modified data frame can be plotted
data = data.astype(float)
### X-axis limits
xmin = 1e-4;
xmax = 3e-3;
## Create subplots to be plotted together after loop
fig, ax = plt.subplots()
### Since there are multiple X-axes (every other column), this loop only plots every other x-y column pair
for i in range(len(header_list)):
if i % 2 == 0:
dfplot = data.plot.scatter(x = "{}".format(header_list[i]), y = "{}".format(header_list[i + 1]), ax=ax)
dfplot.set_xlim(xmin,xmax) # Setting limits on X axis
plot.show()
The dataset can be found in the google drive link below. Thanks for your help!
https://drive.google.com/drive/folders/1DSEs8D7lIDUW4NIPBl2qW2EZiZxslGyM?usp=sharing

Distributing plots across a grid of variable axes length in python

I have written a few lines of Python 3 code to assist me in the automated analysis of data generated using a technique called calorimetry (for radiation dosimetry). In the enclosed example, the analysis of the input file returned eighth 'heating regions' (top panel), and in each region a pair of linear regressions (black segment, red segment) were made on portions of data to calculate the magnitude of the 'step', relative to the average value of my quantity of interest (the varying resistance of a thermistor), which is plotted in the bottom panel of the same figure.
automatic identification of 8 heating regions (top panel) and computed relative step magnitude (bottom panel)
Results of this type of analysis are summarized in a data frame (a ndarray from numpy at present) but, ideally, I would hope to produce also a graphical representation with some annotations in each subplot, including information from the corresponding line in the results dataframe:
Step analysis via a pair of linear regressions and further computation
The general output would look something like this last figure, with each subplot including the same essential information from the previous individual plot.
The output is, in this specific case, a grid (2,4) because there were exactly 8 regions to analyse
This was created by hand, without any iteration, using this portion of code in a Jupyter notebook:
%matplotlib inline
results_fig = pyplt.figure(figsize=(20,10))
results_grid = matplotlib.gridspec.GridSpec(2, 4, hspace=0.2, wspace=0.3)
results_fig.suptitle("Faceted presentation of calorimetric runs", fontsize=15)
ax1 = results_fig.add_subplot(results_grid[0,0])
ax1.scatter(time,resistance, marker ='o', s=20, c='blue')
ax1.plot(time[x1[0]:xmid[0]], line_pre[0], color='black', linewidth=3.0)
ax1.plot(time[xmid[0]:x4[0]], line_post[0], color='red', linewidth=3.0)
ax1.set_xlim(xlim1[0],xlim2[0])
ax1.set_ylabel("resistance [Ohm]")
# [... continues for each subplot in the grid ... ]
Given that the number of 'heating regions' may vary considerably from file to file, i.e. I cannot determine it before analyzing each experimental output datafile, here is my pair of questions:
How can I produce a grid of subplots without prior knowledge of how many subplots it will show? One of the dimensions of the grid could be four, as in the example provided here, but the other is unknown. I could iterate over the length of one of the axes of the numpy results array, but then I would need to span over two axes in my plot grid.
Without re-inventing the wheel, is there a python module that can assist in this direction?
Thanks
Here is how you create a grid of n x 4 subplots and iterate over them
numplots = 10 # number of plots to create
m = 4 # number of columns
n = int(np.ceil(numplots/4.)) # number of rows
fig, axes = plt.subplots(nrows=n,ncols=m)
fig.subplots_adjust(hspace=0.2, wspace=0.3)
for data, ax in zip(alldata, axes.flatten()):
ax.plot(data[0],data[1], color='black')
# further plotting, label setting etc.
# optionally, remove empty plots from grid
if n*m > numplots:
for ax in axes.flatten()[numplots:]:
ax.remove()
##or
#ax.set_visible(False)

Resources