I'm following this linear regression tutorial. Here's my code:
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
dataframe = pd.read_fwf('brain_body.txt')
x_values = dataframe[['Brain']]
y_values = dataframe[['Body']]
body_reg = linear_model.LinearRegression()
body_reg.fit(x_values, y_values)
plt.scatter(x_values, y_values)
plt.plot(x_values, body_reg.predict(x_values))
plt.show()
When I run the script, I get no errors, but the graph doesn't seem to account for the y-values. I reduced the data points to three so it's easier to see:
I tried to manually change the y-axis with plt.ylim([-1000,7000]) but no luck.
Thanks for any suggestions!
There's nothing wrong with the code, it's just that you have a few very extreme values in relation to the rest of your data. Matplotlib expands the graph to show the extreme values, but that ends up in bunching all the others. Broadening your ylim will only increase the effect - try a much smaller ylim and xlim instead:
plt.ylim([0, 20])
plt.xlim([0, 2])
Related
So following a tutorial, I tried to create a graph using the following code:
time_values = [i for i in range(1,100)]
execution_time = [random.randint(0,100) for i in range(1,100)]
fig = plt.figure()
ax1 = plt.subplot()
threshold=[.8 for i in range(len(execution_time))]
ax1.plot(time_values, execution_time)
ax1.margins(x=-.49, y=0)
ax1.fill_between(time_values,execution_time, 1,where=(execution_time>1), color='r', alpha=.3)
This did not work as I got an error saying I could not compare a list and an int.
However, I then tried:
ax1.fill_between(time_values,execution_time, 1)
And that gave me a graph with all area in between the execution time and the y=1 line, filled in. Since I want the area above the y=1 line filled in, with the area below left un-shaded, I created a list called threshold, and populated it with 1 so that I could recreate the comparison. However,
ax1.fill_between(time_values,execution_time, 1,where=(execution_time>threshold)
and
ax1.fill_between(time_values,execution_time, 1)
create the exact same graph, even though the execution times values do go beyond 1.
I am confused for two reasons:
firstly, in the tutorial I was watching, the teacher was able to successfully compare a list and an integer within the fill_between function, why was I not able to do this?
Secondly, why is the where parameter not identifying the regions I want to fill? Ie, why is the graph shading in the areas between the y=1 and the value of the execution time?
The problem is mainly due the use of python lists instead of numpy arrays. Clearly you could use lists, but then you need to use them throughout the code.
import numpy as np
import matplotlib.pyplot as plt
time_values = list(range(1,100))
execution_time = [np.random.randint(0,100) for _ in range(len(time_values))]
threshold = 50
fig, ax = plt.subplots()
ax.plot(time_values, execution_time)
ax.fill_between(time_values, execution_time, threshold,
where= [e > threshold for e in execution_time],
color='r', alpha=.3)
ax.set_ylim(0,None)
plt.show()
Better is the use of numpy arrays throughout. It's not only faster, but also easier to code and understand.
import numpy as np
import matplotlib.pyplot as plt
time_values = np.arange(1,100)
execution_time = np.random.randint(0,100, size=len(time_values))
threshold = 50
fig, ax = plt.subplots()
ax.plot(time_values, execution_time)
ax.fill_between(time_values,execution_time, threshold,
where=(execution_time > threshold), color='r', alpha=.3)
ax.set_ylim(0,None)
plt.show()
guys, I'm a chemist and I've finished an experiment that gave me the energies of a metal d orbitals.
It is relatively easy to get the correct proportion of energies in Excel 1 and use a drawing program like Inkscape to draw the diagram for molecular orbitals (like I did with this one below 2) but I’d love to use python to get a beautiful diagram that considers the energies of my orbitals like we see in the books.
My first attempt using seaborn and swarmplot is obviously too far from the correct approach and maybe (probably!) is not the correct way to get there. I'd be more than happy to achieve something like the right side here in 3.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Energies = [-0.40008, -0.39583, -0.38466, -0.23478, -0.21239]
orbitals = ["dz2", "dxy", "dyz", "dx2y2", "dxz"]
df = pd.DataFrame(Energies)
df["Orbitals"] = pd.DataFrame(orbitals)
sns.swarmplot(y=df[0], size=16)
Thanks for any help.
1 The excel one
2 Drawn by hand using the excel version as the model
3 Extracted from literature
You can draw anything you like deriving from basic shapes and functions in matplotlib. Energy levels could be simple markers, the texts can be produced by annotate.
import numpy as np
import matplotlib.pyplot as plt
Energies = [-0.40008, -0.39583, -0.38466, -0.23478, -0.21239]
orbitals = ["$d_{z^2}$", "$d_{xy}$", "$d_{yz}$", "$d_{x^2 - y^2}$", "$d_{xz}$"]
x = np.arange(len(Energies))
fig, ax = plt.subplots()
ax.scatter(x, Energies, s=1444, marker="_", linewidth=3, zorder=3)
ax.grid(axis='y')
for xi,yi,tx in zip(x,Energies,orbitals):
ax.annotate(tx, xy=(xi,yi), xytext=(0,-4), size=18,
ha="center", va="top", textcoords="offset points")
ax.margins(0.2)
plt.show()
I'm struggling to draw a power law graph for Facebook Data that I found online. I'm using Networkx and I've found how to draw a Degree Histogram and a degree rank. The problem that I'm having is I want the y axis to be a probability so I'm assuming I need to sum up each y value and divide by the total number of nodes? Can anyone please help me do this? Once I've got this I'd like to draw a log-log graph to see if I can obtain a straight line. I'd really appreciate it if anyone could help! Here's my code:
import collections
import networkx as nx
import matplotlib.pyplot as plt
from networkx.algorithms import community
import math
import pylab as plt
g = nx.read_edgelist("/Users/Michael/Desktop/anaconda3/facebook_combined.txt","r")
nx.info(g)
degree_sequence = sorted([d for n, d in g.degree()], reverse=True)
degreeCount = collections.Counter(degree_sequence)
deg, cnt = zip(*degreeCount.items())
fig, ax = plt.subplots()
plt.bar(deg, cnt, width=0.80, color='b')
plt.title("Degree Histogram for Facebook Data")
plt.ylabel("Count")
plt.xlabel("Degree")
ax.set_xticks([d + 0.4 for d in deg])
ax.set_xticklabels(deg)
plt.show()
plt.loglog(degree_sequence, 'b-', marker='o')
plt.title("Degree rank plot")
plt.ylabel("Degree")
plt.xlabel("Rank")
plt.show()
You seem to be on the right tracks, but some simplifications will likely help you. The code below uses only 2 libraries.
Without access your graph, we can use some graph generators instead. I've chosen 2 qualitatively different types here, and deliberately chosen different sizes so that the normalization of the histogram is needed.
import networkx as nx
import matplotlib.pyplot as plt
g1 = nx.scale_free_graph(1000, )
g2 = nx.watts_strogatz_graph(2000, 6, p=0.8)
# we don't need to sort the values since the histogram will handle it for us
deg_g1 = nx.degree(g1).values()
deg_g2 = nx.degree(g2).values()
# there are smarter ways to choose bin locations, but since
# degrees must be discrete, we can be lazy...
max_degree = max(deg_g1 + deg_g2)
# plot different styles to see both
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(deg_g1, bins=xrange(0, max_degree), density=True, histtype='bar', rwidth=0.8)
ax.hist(deg_g2, bins=xrange(0, max_degree), density=True, histtype='step', lw=3)
# setup the axes to be log/log scaled
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('degree')
ax.set_ylabel('relative density')
ax.legend()
plt.show()
This produces an output plot like this (both g1,g2 are randomised so won't be identical):
Here we can see that g1 has an approximately straight line decay in the degree distribution -- as expected for scale-free distributions on log-log axes. Conversely, g2 does not have a scale-free degree distribution.
To say anything more formal, you could look at the toolboxes from Aaron Clauset: http://tuvalu.santafe.edu/~aaronc/powerlaws/ which implement model fitting and statistical testing of power-law distributions.
I am currently going through the Kaggle Titanic Machine Learning thing and using http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/kaggle/titanic.ipynb to figure it out as I am a relative beginner to Python. I thought I understood what the first few steps were doing and I am trying to recreate an earlier step by making a figure with multiple plots on it. I can't seem to get the plots to actually show up.
Here is my code:
`
import pandas as pd
import numpy as np
import pylab as plt
train=pd.read_csv("train.csv")
#Set the global default size of matplotlib figures
plt.rc('figure', figsize=(10, 5))
#Size of matplotlib figures that contain subplots
figsize_with_subplots = (10, 10)
# Size of matplotlib histogram bins
bin_size = 10
females_df = train[train['Sex']== 'female']
print("females_df", females_df)
females_xt = pd.crosstab(females_df['Pclass'],train['Survived'])
females_xt_pct = females_xt.div(females_xt.sum(1).astype(float), axis = 0)
males = train[train['Sex'] == 'male']
males_xt = pd.crosstab(males['Pclass'], train['Survived'])
males_xt_pct= males_xt.div(males_xt.sum(1).astype(float), axis = 0)
plt.figure(5)
plt.subplot(221)
females_xt_pct.plot(kind='bar', title='Female Survival Rate by Pclass')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
plt.subplot(222)
males_xt_pct.plot(kind='bar', title= 'Male Survival Rate by Pclass')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
`
And this is displaying two blank plots separately (one in the 221 location, and then next plot on a new figure in the 222 location) and then another plot with males that actually works at the end. What am I doing wrong here?
In order to plot the pandas plot to apreviously created subplot, you may use the ax argument of the pandas plotting function.
ax=plt.subplot(..)
df.plot(..., ax=ax)
So in this case the code may look like
plt.figure(5)
ax=plt.subplot(221)
females_xt_pct.plot(kind='bar', title='Female Survival Rate by Pclass',ax=ax)
ax2=plt.subplot(222)
males_xt_pct.plot(kind='bar', title= 'Male Survival Rate by Pclass',ax=ax2)
I have the results of a (H,ranges) = numpy.histogram2d() computation and I'm trying to plot it.
Given H I can easily put it into plt.imshow(H) to get the corresponding image. (see http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.imshow )
My problem is that the axis of the produced image are the "cell counting" of H and are completely unrelated to the values of ranges.
I know I can use the keyword extent (as pointed in: Change values on matplotlib imshow() graph axis ). But this solution does not work for me: my values on range are not growing linearly (actually they are going exponentially)
My question is: How can I put the value of range in plt.imshow()? Or at least, or can I manually set the label values of the plt.imshow resulting object?
Editing the extent is not a good solution.
You can just change the tick labels to something more appropriate for your data.
For example, here we'll set every 5th pixel to an exponential function:
import numpy as np
import matplotlib.pyplot as plt
im = np.random.rand(21,21)
fig,(ax1,ax2) = plt.subplots(1,2)
ax1.imshow(im)
ax2.imshow(im)
# Where we want the ticks, in pixel locations
ticks = np.linspace(0,20,5)
# What those pixel locations correspond to in data coordinates.
# Also set the float format here
ticklabels = ["{:6.2f}".format(i) for i in np.exp(ticks/5)]
ax2.set_xticks(ticks)
ax2.set_xticklabels(ticklabels)
ax2.set_yticks(ticks)
ax2.set_yticklabels(ticklabels)
plt.show()
Expanding a bit on #thomas answer
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mi
im = np.random.rand(20, 20)
ticks = np.exp(np.linspace(0, 10, 20))
fig, ax = plt.subplots()
ax.pcolor(ticks, ticks, im, cmap='viridis')
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlim([1, np.exp(10)])
ax.set_ylim([1, np.exp(10)])
By letting mpl take care of the non-linear mapping you can now accurately over-plot other artists. There is a performance hit for this (as pcolor is more expensive to draw than AxesImage), but getting accurate ticks is worth it.
imshow is for displaying images, so it does not support x and y bins.
You could either use pcolor instead,
H,xedges,yedges = np.histogram2d()
plt.pcolor(xedges,yedges,H)
or use plt.hist2d which directly plots your histogram.