When i append all the 10K points and use gcurve its not plotting all of them. I know only 1000 points of the 10K have been plotted.
My python version is 2.7.6
I don't know how to fix this.
In future i am gonna make a data set of more than 30K and use gcurve.
How to plot more than 1k points in gcurve.
Can you please help m ein htis.
Thnaks.
Related
I have some code that will load data, enrich it, perform certain statistical analysis, and eventually generate heatmaps. The data on which these heatmaps are based are small pivot tables (i.e. not too many rows and columns), but I'm sure the seaborn heatmap generation is what makes my code so slow.
I've tried different methods for clearing one or several plots, updating my version of seaborn, nothing seams to help speed this up.
This is my first effort at generating heatmaps. Is seaborn heatmap generation always slow or is there anything I can optimize in my code? (keep in mind, this code is in a for loop that's a few hundred loops long...but the dataframe of data in each loop is small
Thanks for any help you can provide! I saw some similar questions but nothing that's worked so far so any help is appreciated
ax = sns.heatmap(
heatmap_data,
annot=True,
annot_kws={"size": 30},
fmt="g",
)
fig.savefig(graph_path)
matplotlib.pyplot.clf()
This question already has answers here:
How to change spacing between ticks
(4 answers)
Closed 5 months ago.
I'am kind of in a rush to finish this for tomorrows presentation towards the project owner. We are a small group of economic students in germany trying to figure out machine learning with python. We set up a Random Forest Classifier and are desperate to show the estimators important features in a neat plot. By applying google search we came up with the following solution that kind of does the trick, but leaves us unsatisfied due to the overlapping of the labels on the y-axis. The code we used looks like this:
feature_importances = clf.best_estimator_.feature_importances_
feature_importances = 100 * (feature_importances / feature_importances.max())
sorted_idx = np.argsort(feature_importances)
pos = np.arange(sorted_idx.shape[0])
plt.barh(pos, feature_importances[sorted_idx], align='center', height=0.8)
plt.yticks(pos, df_year_four.columns[sorted_idx])
plt.show()
Due to privacy let me say this: The feature names on the y-axis are overlapping (there are about 30 of them). I was looking into the documentation of matplotlib in order to get an understanding of how to do this by myself, unfortunately I couldn't find anything helpful. Seems like training and testing models is easier than understanding matplotlib and creating plots :D
Thank you so much for helping out and taking the time, I appreciate it.
I see your solution, and I want to just add this link here to explain why: How to change spacing between ticks in matplotlib?
The spacing between ticklabels is exclusively determined by the space between ticks on the axes. Therefore the only way to obtain more space between given ticklabels is to make the axes larger.
The question I linked shows that by making the graph large enough, your axis labels would naturally be spaced better.
You are using np.argsort that will return a numpy array with many indices. And you are using that array as labels for your Y-Axis thus there is overlapping of labels.
My suggestion will be to use an index for sorted_idx like,
plt.yticks(pos, df_year_four.columns[sorted_idx[0]])
This will plot only for 1 label.
Got it guys!
'Geistesblitz' as we say in germany! (spiritual lightening)
See the variable feature_importances in the third top row? Add feature_importnaces[:-15]
to view only the top half of the features and loosen up the y-axis. Yes!!! This does well because there are way less important features.
I have been combining boxplots and strippplots with seaborn and I noticed that the boxplot outliers often have larger values that the largest values displayed by the stripplot. How can this be? The boxplot outliers as well as the stripplot are supposed to be real data points right?
This is the code I used to generate the graph:
data_long = pd.melt(data, id_vars=['var'])
sns.boxplot(x='value', y='var', data=data_long, hue='variable', orient='h',
order=sorted(values), palette='Set3')
sns.stripplot(x='value', y='var', data=data_long, hue='variable', orient='h', dodge=True, palette='Set3',
edgecolor='black', linewidth=1, jitter=True)
plt.semilogx(basex=2)
Here is the example:
Does anybody have any idea what is going on?
Highest regards.
As I'm making this question nicer, trying to get rid of that -1, I noticed I have order=(values) only in the boxplot, this makes the data differ between the box and the stripplot. Adding the order parameter also to the stripplot solves the problem.
I am new to this area - I have a background in a Gait and Posture.
I have a series of motion files of timestamped coordinates (containing X, Y, and Z in mm) with a number of joints (30).
What would be the simplest way to extract the following from the motion observations. 1) The number of active features (i.e. active joints). 2) average speed of motion.
Same file is the format of NxP. Where P is the number of joints and N is the number of frame observations.
What I am looking for is some pointers into possible areas to explore.
Regards,
Dan
A couple of possibilities you might like to explore - both using completely free, (and open source), software:
Python + Numpy/SciPy can easily read in your
coordinate values and calculate the data you require - it is also
possible to plot in 3D using matplotlib.
You could use your positional data to animate a stick figure in
Blender - some of the test blends would provide a good starting point for this.
I've been using scipy.stats.gausian_kde but have a few questions about its output. I've plotted the normalised histogram and the gaussian_kde plot on the same graph. Why are the y-values so vastly different? My understanding is that the gaussian_kde plot should touch the tips of the histograms, roughly. Using the scipy.integrate.quad functions I determined the area under the graph to be 0.7, rather than 1.0, which is what I expected.
Actually what I really want is for the gaussian_kde to represent the non-normalised histogram, does anyone know how can I do that?
Your expectations are a little off. The area under each of the KDE's peaks should roughly equal the area in their corresponding bars. That appears to hold, to my eye. Nonadaptive KDEs with a global bandwidth estimate (like scipy.stats.gaussian_kde) tend to broaden multimodal distributions with sharp peaks.
As for the underestimate of the total area under the KDE, I cannot say without the data and the code that you used to do the integration.
In order to make a KDE approximate an unnormalized histogram, you need to multiply by (bin_width*N) where N is the total number of data points.