If I have a table with three columns where the first column represents the name of each point, the second column represent numerical data (mean) and the last column represent (second column + fixed number). The following an example how is the data looks like:
I want to plot this table so I have the following figure
If it is possible how I can plot it using either Microsoft Excel or python or R (Bokeh).
Alright, I only know how to do it in ggplot2, I will answer regarding R here.
These method only works if the data-frame is in the format you provided above.
I rename your column to Name.of.Method, Mean, Mean.2.2
Preparation
Loading csv data into R
df <- read.csv('yourdata.csv', sep = ',')
Change column name (Do this if you don't want to change the code below or else you will need to go through each parameter to match your column names.
names(df) <- c("Name.of.Method", "Mean", "Mean.2.2")
Method 1 - Using geom_segment()
ggplot() +
geom_segment(data=df,aes(x = Mean,
y = Name.of.Method,
xend = Mean.2.2,
yend = Name.of.Method))
So as you can see, geom_segment allows us to specify the end position of the line (Hence, xend and yend)
However, it does not look similar to the image you have above.
The line shape seems to represent error bar. Therefore, ggplot provides us with an error bar function.
Method 2 - Using geom_errorbarh()
ggplot(df, aes(y = Name.of.Method, x = Mean)) +
geom_errorbarh(aes(xmin = Mean, xmax = Mean.2.2), linetype = 1, height = .2)
Usually we don't use this method just to draw a line. However, its functionality fits your requirement. You can see that we use xmin and ymin to specify the head and the tail of the line.
The height input is to adjust the height of the bar at the end of the line in both ends.
I would use hbar for this:
from bokeh.io import show, output_file
from bokeh.plotting import figure
output_file("intervals.html")
names = ["SMB", "DB", "SB", "TB"]
p = figure(y_range=names, plot_height=350)
p.hbar(y=names, left=[4,3,2,1], right=[6.2, 5.2, 4.2, 3.2], height=0.3)
show(p)
However Whisker would also be an option if you really want whiskers instead of interval bars.
Related
I am trying to construct a grouped vertical bar chart in Bokeh from a pandas dataframe. I'm struggling with understanding the use of factor_cmap and how the color mapping works with this function. There's an example in the documentation (https://docs.bokeh.org/en/latest/docs/user_guide/categorical.html#pandas) that was helpful to follow, here:
from bokeh.io import output_file, show
from bokeh.palettes import Spectral5
from bokeh.plotting import figure
from bokeh.sampledata.autompg import autompg_clean as df
from bokeh.transform import factor_cmap
output_file("bar_pandas_groupby_nested.html")
df.cyl = df.cyl.astype(str)
df.yr = df.yr.astype(str)
group = df.groupby(by=['cyl', 'mfr'])
index_cmap = factor_cmap('cyl_mfr', palette=Spectral5, factors=sorted(df.cyl.unique()), end=1)
p = figure(plot_width=800, plot_height=300, title="Mean MPG by # Cylinders and Manufacturer",
x_range=group, toolbar_location=None, tooltips=[("MPG", "#mpg_mean"), ("Cyl, Mfr", "#cyl_mfr")])
p.vbar(x='cyl_mfr', top='mpg_mean', width=1, source=group,
line_color="white", fill_color=index_cmap, )
p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Manufacturer grouped by # Cylinders"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None
show(p)
This yields the following (again, a screen shot from the documentation):
Grouped Vbar output
I understand how factor_cmap is working here, I think. The index for the dataframe has multiple factors and we're only taking the first by slicing (as seen with the end = 1). But when I try to instead set coloring based on the second index level, mfr, (setting start = 1 , end = 2) , the index mapping breaks and I get this. I based this change on my assumption that the factors were hierarchical and I needed to slice them to get the second level.
I think I must be thinking about the indexing with these categorical factors wrong, but I'm not sure what I'm doing wrong. How do I get a categorical mapper to color by the second level of the factor? I assumed the format of the factors was ('cyl', 'mfr') but maybe that assumption is wrong?
Here's the documentation for factor_cmap, although it wasn't very helpful: https://docs.bokeh.org/en/latest/docs/reference/transform.html#bokeh.transform.factor_cmap .
If you mean you are trying this:
index_cmap = factor_cmap('cyl_mfr',
palette=Spectral5,
factors=sorted(df.cyl.unique()),
start=1, end=2)
Then there are at least two issues:
2 is out of bounds for the length of the list of sub-factors ('cyl', 'mfr'). You would just want start=1 and leave end with its default value of None (which means to the end of the list, as usual for any Python slice).
In this specific case, with start=1 that means "colormap based on mfr sub-factors of the values", but you are still configuring the cololormapper with the cylinders as the factors for the map:
factors=sorted(df.cyl.unique())
When the colormapper goes to look up a value with mfr="mazda" in the mapping, it does not find anything (because you only put cylinder values in the mapping) so it gets shaded the default color grey (as expected).
So you could do something like this:
index_cmap = factor_cmap('cyl_mfr',
palette=Spectral5,
factors=sorted(df.mfr.unique()),
start=1)
Which "works" modulo the fact that there are way more manufacturer values than there are colors in the Spectral5 palette:
In the real situation you'll need to make sure you use a palette as least as big as the number of (sub-)factors that you configure.
I have a data set I filtered to the following (sample data):
Name Time l
1 1.129 1G-d
1 0.113 1G-a
1 3.374 1B-b
1 3.367 1B-c
1 3.374 1B-d
2 3.355 1B-e
2 3.361 1B-a
3 1.129 1G-a
I got this data after filtering the data frame and converting it to CSV file:
# Assigns the new data frame to "df" with the data from only three columns
header = ['Names','Time','l']
df = pd.DataFrame(df_2, columns = header)
# Sorts the data frame by column "Names" as integers
df.Names = df.Names.astype(int)
df = df.sort_values(by=['Names'])
# Changes the data to match format after converting it to int
df.Time=df.Time.astype(int)
df.Time = df.Time/1000
csv_file = df.to_csv(index=False, columns=header, sep=" " )
Now, I am trying to graph lines for each label column data/items with markers.
I want the column l as my line names (labels) - each as a new line, Time as my Y-axis values and Names as my X-axis values.
So, in this case, I would have 7 different lines in the graph with these labels: 1G-d, 1G-a, 1B-b, 1B-c, 1B-d, 1B-e, 1B-a.
I have done the following so far which is the additional settings, but I am not sure how to graph the lines.
plt.xlim(0, 60)
plt.ylim(0, 18)
plt.legend(loc='best')
plt.show()
I used sns.lineplot which comes with hue and I do not want to have name for the label box. Also, in that case, I cannot have the markers without adding new column for style.
I also tried ply.plot but in that case, I am not sure how to have more lines. I can only give x and y values which create only one line.
If there's any other source, please let me know below.
Thanks
The final graph I want to have is like the following but with markers:
You can apply a few tweaks to seaborn's lineplot. Using some created data since your sample isn't really long enough to demonstrate:
# Create data
np.random.seed(2019)
categories = ['1G-d', '1G-a', '1B-b', '1B-c', '1B-d', '1B-e', '1B-a']
df = pd.DataFrame({'Name':np.repeat(range(1,11), 10),
'Time':np.random.randn(100).cumsum(),
'l':np.random.choice(categories, 100)
})
# Plot
sns.lineplot(data=df, x='Name', y='Time', hue='l', style='l', dashes=False,
markers=True, ci=None, err_style=None)
# Temporarily removing limits based on sample data
#plt.xlim(0, 60)
#plt.ylim(0, 18)
# Remove seaborn legend title & set new title (if desired)
ax = plt.gca()
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles[1:], labels=labels[1:], title='New Title', loc='best')
plt.show()
To apply markers, you have to specify a style variable. This can be the same as hue.
You likely want to remove dashes, ci, and err_style
To remove the seaborn legend title, you can get the handles and labels, then re-add the legend without the first handle and label. You can also specify the location here and set a new title if desired (or just remove title=... for no title).
Edits per comments:
Filtering your data to only a subset of level categories can be done fairly easily via:
categories = ['1G-d', '1G-a', '1B-b', '1B-c', '1B-d', '1B-e', '1B-a']
df = df.loc[df['l'].isin(categories)]
markers=True will fail if there are too many levels. If you are only interested in marking points for aesthetic purposes, you can simply multiply a single marker by the number of categories you are interested in (which you have already created to filter your data to categories of interest): markers='o'*len(categories).
Alternatively, you can specify a custom dictionary to pass to the markers argument:
points = ['o', '*', 'v', '^']
mult = len(categories) // len(points) + (len(categories) % len(points) > 0)
markers = {key:value for (key, value)
in zip(categories, points * mult)}
This will return a dictionary of category-point combinations, cycling over the marker points specified until each item in categories has a point style.
I am a beginner in Python (using Python 3.7 in Spyder 3.3.2 and Anaconda Navigator 1.9.6). I have no problem creating seaborn violin plots, but the moment I try to Facetgrid them I run into issues. I tried using catplot.
Here is my violin plot code (it works):
# Libraries
import seaborn as sns
import pandas as pd
import os # Imports `os`
from matplotlib import pyplot as plt
os.chdir(r"XXXXXX") # Changes directory
os.listdir('.') # Lists all files and directories in current directory
## Data set
File = 'test_eventcountratios.xlsx' # Assigns Excel filename to File
df = pd.read_excel(File)
ax = sns.violinplot(x = df["Timepoint"], y = df["Macrophage Frequency"], palette = "Blues")
ax.set_xticklabels(ax.get_xticklabels(),rotation=30)
My data is long form, so all timepoints are in the first column and "Macrophage Frequency" data are in the second column. All remaining columns represent other cell types. Here is a screenshot of my data spreadsheet
Here is my catplot code (it doesn't work):
g=sns.catplot(data=df, x="Timepoint", y=df["B cell Frequency","Neutrophil Frequency","NK cell Frequency","Macrophage Frequency"],
palette = "Blues",
kind = "violin", split=True)
I get "Key Error: ('B cell Frequency', 'Neutrophil Frequency', 'NK cell Frequency', 'Macrophage Frequency')"
I don't even want to call on each column individually. I would like the code to run through each column (cell type) to gather data and put each column's data into it's own plot.
I stripped the catplot code to basics to see if that worked:
g=sns.catplot(x = df["Timepoint"], y = df["Macrophage Frequency"], palette = "Blues", data=df, kind="violin")
It works and produces a violin plot, but with this error: "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
So...
I want to make a grid of multiple violin plots (Timepoint on X axis, Cell type frequency on Y axis), where each plot takes data from each column. Why am I only successful when I limit my "y" to a single column from my dataframe?
I've Googled all of my errors, but I can't seem to make the right changes to my code. If I change one thing, then I get a new error (like "TypeError: object of type 'NonType' has no len()", "ValueError: num must be 1 <= num <= 0, not 1", etc)
Use this:
g = sns.catplot(x = "Timepoint", y = "Macrophage Frequency", palette = "Blues", data=df, kind="violin")
x and y is simply the column name in df.
Background:
I'm working on a program to show a 2d cross section of 3d data. The data is stored in a simple text csv file in the format x, y, z1, z2, z3, etc. I take a start and end point and flick through the dataset (~110,000 lines) to create a line of points between these two locations, and dump them into an array. This works fine, and fairly quickly (takes about 0.3 seconds). To then display this line, I've been creating a matplotlib stacked bar chart. However, the total run time of the program is about 5.5 seconds. I've narrowed the bulk of it (3 seconds worth) down to the code below.
'values' is an array with the x, y and z values plus a leading identifier, which isn't used in this part of the code. The first plt.bar is plotting the bar sections, and the second is used to create an arbitrary floor of -2000. In order to generate a continuous looking section, I'm using an interval between each bar of zero.
import matplotlib.pyplot as plt
for values in crossSection:
prevNum = None
layerColour = None
if values != None:
for i in range(3, len(values)):
if values[i] != 'n':
num = float(values[i].strip())
if prevNum != None:
plt.bar(spacing, prevNum-num, width=interval, \
bottom=num, color=layerColour, \
edgecolor=None, linewidth=0)
prevNum = num
layerColour = layerParams[i].strip()
if prevNum != None:
plt.bar(spacing, prevNum+2000, width=interval, bottom=-2000, \
color=layerColour, linewidth=0)
spacing += interval
I'm sure there's a more efficient way to do this, but I'm new to Matplotlib and still unfamilar with its capabilities. The other main use of time in the code is:
plt.savefig('output.png')
which takes about a second, but I figure this is to be expected to save the file and I can't do anything about it.
Question:
Is there a faster way of generating the same output (a stacked bar chart or something that looks like one) by using plt.bar() better, or a different Matplotlib function?
EDIT:
I forgot to mention in the original post that I'm using Python 3.2.3 and Matplotlib 1.2.0
Leaving this here in case someone runs into the same problem...
While not exactly the same as using bar(), with a sufficiently large dataset (large enough that using bar() takes a few seconds) the results are indistinguishable from stackplot(). If I sort the data into layers using the method given by tcaswell and feed it into stackplot() the chart is created in 0.2 seconds, rather than 3 seconds.
EDIT
Code provided by tcaswell to turn the data into layers:
accum_values = []
for values in crosssection:
accum_values.append([float(v.strip()) for v iv values[3:]])
accum_values = np.vstack(accum_values).T
layer_params = [l.strip() for l in layerParams]
bottom = numpy.zeros(accum_values[0].shape)
It looks like you are drawing each bar, you can pass sequences to bar (see this example)
I think something like:
accum_values = []
for values in crosssection:
accum_values.append([float(v.strip()) for v iv values[3:]])
accum_values = np.vstack(accum_values).T
layer_params = [l.strip() for l in layerParams]
bottom = numpy.zeros(accum_values[0].shape)
ax = plt.gca()
spacing = interval*numpy.arange(len(accum_values[0]))
for data,color is zip(accum_values,layer_params):
ax.bar(spacing,data,bottom=bottom,color=color,linewidth=0,width=interval)
bottom += data
will be faster (because each call to bar creates one BarContainer and I suspect the source of your issues is you were creating one for each bar, instead of one for each layer).
I don't really understand what you are doing with the bars that have tops below their bottoms, so I didn't try to implement that, so you will have to adapt this a bit.
I want to plot data using fit function : function f(x) = a+b*x**2. After ploting i have this result:
correlation matrix of the fit parameters:
m n
m 1.000
n -0.935 1.000
My question is : how can i found a correlation coefficient on gnuplot ?
You can use the stats command in gnuplot, which has syntax similar to the plot command:
stats "file.dat" using 2:(f($2)) name "A"
The correlation coefficient will be stored in the A_correlation variable. (With no name specification, it would be STATS_correlation.) You can use it subsequently to plot your data or just print on the screen using the set label command:
set label 1 sprintf("r = %4.2f",A_correlation) at graph 0.1, graph 0.85
You can find more about the stats command in gnuplot documentation.
Although there is no direct solution to this problem, a workaround is possible. I'll illustrate it using python/numpy. First, the part of the gnuplot script that generates the fit and connects with a python script:
file = "my_data.tsv"
f(x)=a+b*(x)
fit f(x) file using 2:3 via a,b
r = system(sprintf("python correlation.py %s",file))
ti = sprintf("y = %.2f + %.2fx (r = %s)", a, b, r)
plot \
file using 2:3 notitle,\
f(x) title ti
This runs correlation.py to retrieve the correlation 'r' in string format. It uses 'r' to generate a title for the fit line. Then, correlation.py:
from numpy import genfromtxt
from numpy import corrcoef
import sys
data = genfromtxt(sys.argv[1], delimiter='\t')
r = corrcoef(data[1:,1],data[1:,2])[0,1]
print("%.3f" % r).lstrip('0')
Here, the first row is assumed to be a header row. Furthermore, the columns to calculate the correlation for are now hardcoded to nr. 1 and 2. Of course, both settings can be changed and turned into arguments as well.
The resulting title of the fit line is (for a personal example):
y = 2.15 + 1.58x (r = .592)
Since you are probably using fit function you can first refer to this link to arrive at R2 values.
The link uses certain existing variables like FIT_WSSR, FIT_NDF to calculate R2 value.
The code for R2 is stated as:
SST = FIT_WSSR/(FIT_NDF+1)
SSE=FIT_WSSR/(FIT_NDF)
SSR=SST-SSE
R2=SSR/SST
The next step would be to show the R^2 values on the graph. Which can be achieved using the code :
set label 1 sprintf("r = %f",R2) at graph 0.7, graph 0.7
If you're looking for a way to calculate the correlation coefficient as defined on this page, you are out of luck using gnuplot as explained in this Google Groups thread.
There are lots of other tools for calculating correlation coefficients, e.g. numpy.