Need help speeding up seaborn heatmap to png in a for loop - python-3.x

I have some code that will load data, enrich it, perform certain statistical analysis, and eventually generate heatmaps. The data on which these heatmaps are based are small pivot tables (i.e. not too many rows and columns), but I'm sure the seaborn heatmap generation is what makes my code so slow.
I've tried different methods for clearing one or several plots, updating my version of seaborn, nothing seams to help speed this up.
This is my first effort at generating heatmaps. Is seaborn heatmap generation always slow or is there anything I can optimize in my code? (keep in mind, this code is in a for loop that's a few hundred loops long...but the dataframe of data in each loop is small
Thanks for any help you can provide! I saw some similar questions but nothing that's worked so far so any help is appreciated
ax = sns.heatmap(
heatmap_data,
annot=True,
annot_kws={"size": 30},
fmt="g",
)
fig.savefig(graph_path)
matplotlib.pyplot.clf()

Related

Overlapping/crowded labels on y-axis python [duplicate]

This question already has answers here:
How to change spacing between ticks
(4 answers)
Closed 5 months ago.
I'am kind of in a rush to finish this for tomorrows presentation towards the project owner. We are a small group of economic students in germany trying to figure out machine learning with python. We set up a Random Forest Classifier and are desperate to show the estimators important features in a neat plot. By applying google search we came up with the following solution that kind of does the trick, but leaves us unsatisfied due to the overlapping of the labels on the y-axis. The code we used looks like this:
feature_importances = clf.best_estimator_.feature_importances_
feature_importances = 100 * (feature_importances / feature_importances.max())
sorted_idx = np.argsort(feature_importances)
pos = np.arange(sorted_idx.shape[0])
plt.barh(pos, feature_importances[sorted_idx], align='center', height=0.8)
plt.yticks(pos, df_year_four.columns[sorted_idx])
plt.show()
Due to privacy let me say this: The feature names on the y-axis are overlapping (there are about 30 of them). I was looking into the documentation of matplotlib in order to get an understanding of how to do this by myself, unfortunately I couldn't find anything helpful. Seems like training and testing models is easier than understanding matplotlib and creating plots :D
Thank you so much for helping out and taking the time, I appreciate it.
I see your solution, and I want to just add this link here to explain why: How to change spacing between ticks in matplotlib?
The spacing between ticklabels is exclusively determined by the space between ticks on the axes. Therefore the only way to obtain more space between given ticklabels is to make the axes larger.
The question I linked shows that by making the graph large enough, your axis labels would naturally be spaced better.
You are using np.argsort that will return a numpy array with many indices. And you are using that array as labels for your Y-Axis thus there is overlapping of labels.
My suggestion will be to use an index for sorted_idx like,
plt.yticks(pos, df_year_four.columns[sorted_idx[0]])
This will plot only for 1 label.
Got it guys!
'Geistesblitz' as we say in germany! (spiritual lightening)
See the variable feature_importances in the third top row? Add feature_importnaces[:-15]
to view only the top half of the features and loosen up the y-axis. Yes!!! This does well because there are way less important features.

Is ColumnDataSource() the only way to get plots updated in a bokeh web app?

My data is in a large multi-indexed pandas DataFrame. I re-index to flatten the DataFrame and then feed it through ColumnDataSource, but I need to group my data row wise in order to plot it correctly (think bunch of torque curves corresponding to a bunch of gears for a car). If I just plot the dictionary output of ColumnDataSource, it's a mess.
I've tried converting the ColumnDataSource output back to DataFrame, but then I lose the update functionality, the callback won't touch the DataFrame, and the plots won't change. Anyone have any ideas?
The short answer to the question in the title is "Yes". The ColumnDataSource is the special, central data structure of Bokeh. It provides the data for all the glyphs in a plot, or content in data tables, and automatically keeps that data synchronized on the Python and JavaScript sides, so that you don't have to, e.g write a bunch of low-level websocket code yourself. To update things like glyphs in a plot, you update the CDS that drives them.
It's possible there are improvements that could be made in your approach to updating the CDS, but it is impossibe to speculate without seeing actual code for what you have tried.

Stripplot and boxplot outliers do not overlap

I have been combining boxplots and strippplots with seaborn and I noticed that the boxplot outliers often have larger values that the largest values displayed by the stripplot. How can this be? The boxplot outliers as well as the stripplot are supposed to be real data points right?
This is the code I used to generate the graph:
data_long = pd.melt(data, id_vars=['var'])
sns.boxplot(x='value', y='var', data=data_long, hue='variable', orient='h',
order=sorted(values), palette='Set3')
sns.stripplot(x='value', y='var', data=data_long, hue='variable', orient='h', dodge=True, palette='Set3',
edgecolor='black', linewidth=1, jitter=True)
plt.semilogx(basex=2)
Here is the example:
Does anybody have any idea what is going on?
Highest regards.
As I'm making this question nicer, trying to get rid of that -1, I noticed I have order=(values) only in the boxplot, this makes the data differ between the box and the stripplot. Adding the order parameter also to the stripplot solves the problem.

How to visualize error surface in keras?

We see pretty pictures of error surface with a global minima and convergence of a neural network in many books. How can I visualize something similar in keras i.e containing error surface and how my model is converging to achieve global minimal error? Below is an example image of such illustrations. And this link has animated illustration of different optimizers. I explored tensorboard log callback for this purpose but could not find any such thing. A little guidance will be appreciated.
The pictures and animations are made for didatic purposes, but the error surface is completely unknown (or incredibly complex to be understood or visualized). That's the whole idea behind using gradient descent.
We only know, at a single point, the direction towards which the funcion increases, through getting the current gradient.
You could try to plot the way (line) you're following by getting the weights values at each iteration and the error, but then you'd face another problem: it's a massively multidimensional function. It's not actually a surface. The number of variables is the number of weights you have in the model (often thousands or even millions). This is absolutely impossible to visualize or even conceive as a visual thing.
To plot such a surface, you'd have to manually change all thousands of weights to get the error for each arrangement. Besides the "impossible to visualize" problem, this would be excessively time consuming.

Is overwriting happening in the following code, and how to avoid it?

I wrote this following (written at the end of my question) piece of code which is error-free, but I think, while running it, it has an overwriting problem. During the program, there are two cases where I wanted to draw graphs; first, the graphs of the curves written with ezplot, and second, the plot regression where I wanted to draw regression lines.
When I skip the code plotregression(C_i, D_i), it has no problem displaying the graphs of all five logistic functions (actually one users here showed me the hold on-hold off codes to help doing that), but then, when I incorporate plotregression(C_i, D_i), two things happen:
it shows me all the regression lines, but contrary to having all
the regression lines all in the same figure, it keeps changing the
regression lines with varying regression coefficients. You can
actually see this happening if you run the code.
The effect of plotregression(C_i, D_i) is gone; it no more
plots the graphs of the five logistic functions.
I've two questions:
If I want to get two figures, one showing all the five logistic
curves, and the other showing all the five regression curves, how
can I modify the program minimally so as to get the job done?
How can I stop over writing the regression curves? I used the 'hold on-hold off' in order to avoid the same for the logistic curves, but is it not working on the regression curves?
Here's the code:
syms t;
hold on;
for i=1:5;
P_i=0.009;
r_i=abs(sin(i.^-1));
y_i(t)= P_i*exp(r_i*t)/(1+P_i*(exp(r_i*t)-1));
t_1= 1+rand; t_2= 16+rand; t_3=31+rand;
time_points=[1, t_1; 1, t_2; 1, t_3];
biomarker_values= double([y_i(t_1);y_i(t_2);y_i(t_3)]);
X=vertcat(X,time_points);
Z=blkdiag(Z,time_points);
Y=vertcat(Y,biomarker_values);
G=vertcat(G,[i,i,i]');
ezplot(y_i,[-50,100]);
C_i=time_points(:,2)
D_i=biomarker_values
plotregression(C_i, D_i)
end;
hold off;

Resources