Issues with seaborn.catplot - python-3.x

I am a beginner in Python (using Python 3.7 in Spyder 3.3.2 and Anaconda Navigator 1.9.6). I have no problem creating seaborn violin plots, but the moment I try to Facetgrid them I run into issues. I tried using catplot.
Here is my violin plot code (it works):
# Libraries
import seaborn as sns
import pandas as pd
import os # Imports `os`
from matplotlib import pyplot as plt
os.chdir(r"XXXXXX") # Changes directory
os.listdir('.') # Lists all files and directories in current directory
## Data set
File = 'test_eventcountratios.xlsx' # Assigns Excel filename to File
df = pd.read_excel(File)
ax = sns.violinplot(x = df["Timepoint"], y = df["Macrophage Frequency"], palette = "Blues")
ax.set_xticklabels(ax.get_xticklabels(),rotation=30)
My data is long form, so all timepoints are in the first column and "Macrophage Frequency" data are in the second column. All remaining columns represent other cell types. Here is a screenshot of my data spreadsheet
Here is my catplot code (it doesn't work):
g=sns.catplot(data=df, x="Timepoint", y=df["B cell Frequency","Neutrophil Frequency","NK cell Frequency","Macrophage Frequency"],
palette = "Blues",
kind = "violin", split=True)
I get "Key Error: ('B cell Frequency', 'Neutrophil Frequency', 'NK cell Frequency', 'Macrophage Frequency')"
I don't even want to call on each column individually. I would like the code to run through each column (cell type) to gather data and put each column's data into it's own plot.
I stripped the catplot code to basics to see if that worked:
g=sns.catplot(x = df["Timepoint"], y = df["Macrophage Frequency"], palette = "Blues", data=df, kind="violin")
It works and produces a violin plot, but with this error: "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
So...
I want to make a grid of multiple violin plots (Timepoint on X axis, Cell type frequency on Y axis), where each plot takes data from each column. Why am I only successful when I limit my "y" to a single column from my dataframe?
I've Googled all of my errors, but I can't seem to make the right changes to my code. If I change one thing, then I get a new error (like "TypeError: object of type 'NonType' has no len()", "ValueError: num must be 1 <= num <= 0, not 1", etc)

Use this:
g = sns.catplot(x = "Timepoint", y = "Macrophage Frequency", palette = "Blues", data=df, kind="violin")
x and y is simply the column name in df.

Related

Pandas Series boolean maps and plotting

I am just trying to up my understanding of plotting Pandas Series data using Booleans to mask out values I don't want. I am not sure that what I have is the correct or efficient way to do it.
Don't get me wrong, I do get the chart I am after but are my assumptions on the syntax correct?
All I want to do is plot the non zero values on my chart. I have not formatted the charts as I would normally as this was just a test of Booleans and masking data and not for creating report grade charts.
If I masked this as a Pandas DataFrame I would do the following if df1 were my DataFrame.
I understand this and it makes sense that the df1[mask] returns my values as required
# Plot our graph with only items that are non-zero
fig = px.bar(df1[mask], x = 'Animals', y = 'Count')
fig.show()
Doing it as a Pandas Series
This is the snippet that creates the graph I require
# Plot our graph with only items that are non-zero
fig = px.bar(sf, x = sf.index[sf_mask], y = sf[sf_mask])
fig.show()
After my initial test with adding my mask to sf and getting an error. I deduced that I needed to add the mask against the x and y parameters. I take it this is because a Series is just a single column and the index is set as my "animals". Therefore by mapping the sf.index[sf_mask] I get the returned animals in the index and sf[sf_mask] returns me the values. failure to add either one would give a "ValueError" stating that the arguments should have the same length.
Here is what I did to test my workings
My initial imports and setting up Plotly as my plotting backend
import pandas as pd
import plotly.express as px
# Set our plotting backend to Plotly
pd.options.plotting.backend = "plotly"
I just created a test dataset from a dictionary
animals = {'rabbits' : 1,
'dogs' : 3,
'cats' : 0,
'ferrets' : 3,
'horses' : 8,
'goldfish' : 0,
'guinea_pigs' : 2,
'hamsters' : 6,
'mice' : 3,
'rats' : 0
}
Then converted it to a pandas Series
sf = pd.Series(animals)
I then create my boolean mask to mask out all our non-Zero entries on our Pandas Series
sf_mask = sf != 0
And if I then view the mask I can see I only get non zero values which is exactly what I am looking for.
sf[sf_mask]
Which outputs my non-zero items in my series.
rabbits 1
dogs 3
ferrets 3
horses 8
guinea_pigs 2
hamsters 6
mice 3
dtype: int64
If I plot without my Boolean mask 'sf_mask' using the following syntax I get my complete Pandas Series charted
# Plot our Series showing all items
fig = px.bar(sf, x = sf.index, y = sf)
fig.show()
Which outputs the following chart
If I plot with my Boolean mask 'sf_mask' using the following syntax I get the chart I want which excludes the gaps with zero value items.
# Plot our graph with only items that are non-zero
fig = px.bar(sf, x = sf.index[sf_mask], y = sf[sf_mask])
fig.show()
Which outputs the correct chart.
Your understanding of booleans and masking is correct.
You can simplify your syntax a little though: if you take a look at the plotly.express.bar documentation, you'll see that the arguments 'x' and 'y' are optional. You don't need to pass 'x' or 'y' because by default plotly.express will create the bars using the index of the Series as x and the values of the Series as y. You can also pass the masked series in place of the entire series.
For example, this will produce the same bar chart:
fig = px.bar(sf[sf>0])
fig.update_layout(showlegend=False)

How to group-by twice, preserve original columns, and plot

I have the following data sets (only sample is shown):
I want to find the most impactful exercise per area and then plot it via Seaborn barplot.
I use the following code to do so.
# Create Dataset Using Only Area, Exercise and Impact Level Chategories
CA_data = Data[['area', 'exercise', 'impact level']]
# Compute Mean Impact Level per Exercise per Area
mean_il_CA = CA_data.groupby(['area', 'exercise'])['impact level'].mean().reset_index()
mean_il_CA_hello = mean_il_CA.groupby('area')['impact level'].max().reset_index()
# Plot
cx = sns.barplot(x="impact level", y="area", data=mean_il_CA_hello)
plt.title('Most Impactful Exercises Considering Area')
plt.show()
The resulting dataset is:
This means that when I plot, on the y axis only the label relative to the area appears, NOT 'area label' + 'exercise label' like I would like.
How do I reinsert 'exercise column into my final dataset?
How do I get both the name of the area and the exercise on the y plot?
The problem of losing the values of 'exercise' when grouping by the maximum of 'area' can be solved by keeping the MultiIndex (i.e. not using reset_index) and using .transform to create a boolean mask to select the appropriate full rows of mean_il_CA that contain the maximum 'impact_level' values per 'area'. This solution is based on the code provided in this answer by unutbu. The full labels for the bar chart can be created by concatenating the labels of 'area' and 'exercise'.
Here is an example using the titanic dataset from the seaborn package. The variables 'class', 'embark_town', and 'fare' are used in place of 'area', 'exercise', and 'impact_level'. The categorical variables both contain three unique values: 'First', 'Second', 'Third', and 'Cherbourg', 'Queenstown', 'Southampton'.
import pandas as pd # v 1.2.5
import seaborn as sns # v 0.11.1
df = sns.load_dataset('titanic')
data = df[['class', 'embark_town', 'fare']]
data.head()
data_mean = data.groupby(['class', 'embark_town'])['fare'].mean()
data_mean
# Select max values in each class and create concatenated labels
mask_max = data_mean.groupby(level=0).transform(lambda x: x == x.max())
data_mean_max = data_mean[mask_max].reset_index()
data_mean_max['class, embark_town'] = data_mean_max['class'].astype(str) + ', ' \
+ data_mean_max['embark_town']
data_mean_max
# Draw seaborn bar chart
sns.barplot(data=data_mean_max,
x=data_mean_max['fare'],
y=data_mean_max['class, embark_town'])

How to change scatter plot marker color in plotting loop using pandas?

I'm trying to write a simple program that reads in a CSV with various datasets (all of the same length) and automatically plots them all (as a Pandas Dataframe scatter plot) on the same figure. My current code does this well, but all the marker colors are the same (blue). I'd like to figure out how to make a colormap so that in the future, if I have much larger data sets (let's say, 100+ different X-Y pairings), it will automatically color each series as it plots. Eventually, I would like for this to be a quick and easy method to run from the command line. I did not have luck reading the documentation or stack exchange, hopefully this is not a duplicate!
I've tried the recommendations from these posts:
1)Setting different color for each series in scatter plot on matplotlib
2)https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html
3) https://matplotlib.org/users/colormaps.html
However, the first one essentially grouped the data points according to their position on the x-axis and made those groups of data the same color (not what I want, each series of data is roughly a linearly increasing function). The second and third links seemed to have worked, but I don't like the colormap choices (e.g. "viridis", many colors are too similar and it's hard to distinguish data points).
This is a simplified version of my code so far (took out other lines that automatically named axes, etc. to make it easier to read). I've also removed any attempts I've made to specify a colormap, for more of a blank canvas feel:
''' Importing multiple scatter data and plotting '''
import pandas as pd
import matplotlib.pyplot as plt
### Data file path (please enter Dataframe however you like)
path = r'/Users/.../test_data.csv'
### Read in data CSV
data = pd.read_csv(path)
### List of headers
header_list = list(data)
### Set data type to float so modified data frame can be plotted
data = data.astype(float)
### X-axis limits
xmin = 1e-4;
xmax = 3e-3;
## Create subplots to be plotted together after loop
fig, ax = plt.subplots()
### Since there are multiple X-axes (every other column), this loop only plots every other x-y column pair
for i in range(len(header_list)):
if i % 2 == 0:
dfplot = data.plot.scatter(x = "{}".format(header_list[i]), y = "{}".format(header_list[i + 1]), ax=ax)
dfplot.set_xlim(xmin,xmax) # Setting limits on X axis
plot.show()
The dataset can be found in the google drive link below. Thanks for your help!
https://drive.google.com/drive/folders/1DSEs8D7lIDUW4NIPBl2qW2EZiZxslGyM?usp=sharing

Carrying out multiple piecewise regressions with variables from same dataframe (but varying columnpair lengths)

I'm trying to analyse and plot piecewise regressions for daily temperature and gas use. I have six columns (two corresponding to each year) within a csv which I am pulling in using pandas then defining each column as a seperate variable.
I found one of the answers on How to apply piecewise linear fit in Python? extremely helpful and was able to use the following code to run a breakpoint analysis and also plot a graph:
import matplotlib.pyplot as plt
import pwlf
# Importing the csv and defining columns as variables
df = pd.read_csv(PATH)
Y_A = df.Column1
X_A = df.Column2
Y_B = df.Column3
X_B = df.Column4
# Analysing breakpoints
my_pwlf_a = pwlf.PiecewiseLinFit(X_A, Y_A)
breaks_a = my_pwlf_a.fit(2)
print(breaks_a)
# Graphing
x_hat = np.linspace(X_A.min(), X_A.max(), 100)
y_hat = my_pwlf.predict(x_hat)
plt.figure()
plt.plot(X_A, Y_A, 'o')
plt.plot(x_hat, y_hat, '-')
plt.xlabel('X'); plt.ylabel('Y');
plt.show()
This runs with no problems and gives the results the desired.
When I try to repurpose the code using my next pair of variables (Y_B and X_B) I run into problems:
my_pwlf_b = pwlf.PiecewiseLinFit(X_B, Y_B)
breaks_b = my_pwlf_b.fit(2)
print(breaks_b)
The error returned is:
ValueError: bounds should be a sequence containing real valued (min, max) pairs for each value in x
All variables are float64 and each column contains 366 rows. Thanks for any help in spotting what I'm missing!
Thansk to Zionsof for the nudge back towards the data!
Further testing shows that unequal lengths of the column pairings was the problem (e.g. Columns 1 & 2 contained 366 while Columns 3 & 4 contained 365). I had foolishly thought that seperating the columns into seperate variables may fix this but I was incorrect. Here is what I used to fix it (numpy.isfinite):
# Remove any blanks by ensuring the values are finite
Y_A = df.Column1[np.isfinite(df['Column1'])]
X_A = df.Column2[np.isfinite(df['Column2'])]
Y_B = df.Column3[np.isfinite(df['Column3'])]
X_B = df.Column4[np.isfinite(df['Column4'])]

Plot the distance between every two points in 2 D

If I have a table with three columns where the first column represents the name of each point, the second column represent numerical data (mean) and the last column represent (second column + fixed number). The following an example how is the data looks like:
I want to plot this table so I have the following figure
If it is possible how I can plot it using either Microsoft Excel or python or R (Bokeh).
Alright, I only know how to do it in ggplot2, I will answer regarding R here.
These method only works if the data-frame is in the format you provided above.
I rename your column to Name.of.Method, Mean, Mean.2.2
Preparation
Loading csv data into R
df <- read.csv('yourdata.csv', sep = ',')
Change column name (Do this if you don't want to change the code below or else you will need to go through each parameter to match your column names.
names(df) <- c("Name.of.Method", "Mean", "Mean.2.2")
Method 1 - Using geom_segment()
ggplot() +
geom_segment(data=df,aes(x = Mean,
y = Name.of.Method,
xend = Mean.2.2,
yend = Name.of.Method))
So as you can see, geom_segment allows us to specify the end position of the line (Hence, xend and yend)
However, it does not look similar to the image you have above.
The line shape seems to represent error bar. Therefore, ggplot provides us with an error bar function.
Method 2 - Using geom_errorbarh()
ggplot(df, aes(y = Name.of.Method, x = Mean)) +
geom_errorbarh(aes(xmin = Mean, xmax = Mean.2.2), linetype = 1, height = .2)
Usually we don't use this method just to draw a line. However, its functionality fits your requirement. You can see that we use xmin and ymin to specify the head and the tail of the line.
The height input is to adjust the height of the bar at the end of the line in both ends.
I would use hbar for this:
from bokeh.io import show, output_file
from bokeh.plotting import figure
output_file("intervals.html")
names = ["SMB", "DB", "SB", "TB"]
p = figure(y_range=names, plot_height=350)
p.hbar(y=names, left=[4,3,2,1], right=[6.2, 5.2, 4.2, 3.2], height=0.3)
show(p)
However Whisker would also be an option if you really want whiskers instead of interval bars.

Resources