Issue when trying to plot after applying PCA on a dataset - python-3.x

I am trying to plot the results of PCA of the dataset pima-indians-diabetes.csv. My code shows a problem only in the plotting piece:
import numpy
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
# Dataset Description:
# 1. Number of times pregnant
# 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
# 3. Diastolic blood pressure (mm Hg)
# 4. Triceps skin fold thickness (mm)
# 5. 2-Hour serum insulin (mu U/ml)
# 6. Body mass index (weight in kg/(height in m)^2)
# 7. Diabetes pedigree function
# 8. Age (years)
# 9. Class variable (0 or 1)
path = 'pima-indians-diabetes.data.csv'
dataset = numpy.loadtxt(path, delimiter=",")
X = dataset[:,0:8]
Y = dataset[:,8]
features = ['1','2','3','4','5','6','7','8','9']
df = pd.read_csv(path, names=features)
x = df.loc[:, features].values # Separating out the values
y = df.loc[:,['9']].values # Separating out the target
x = StandardScaler().fit_transform(x) # Standardizing the features
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
# principalDf = pd.DataFrame(data=principalComponents, columns=['pca1', 'pca2'])
# finalDf = pd.concat([principalDf, df[['9']]], axis = 1)
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
for color, i, target_name in zip(colors, [0, 1, 2], ['Negative', 'Positive']):
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of pima-indians-diabetes Dataset')
The error is located at the following line:
Traceback (most recent call last):
File "test.py", line 53, in <module>
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
IndexError: too many indices for array
Kindly, how to fix this?

As the error indicates some kind of shape/dimension mismatch, a good starting point is to check the shapes of the arrays involved in the operation:
principalComponents.shape
yields
(768, 2)
while
(y==i).shape
(768, 1)
Which leads to a shape mismatch when trying to run
principalComponents[y==i, 0]
as the first array is already multidimensional, therefore the error is indicating that you used too many indices for the array.
You can fix this by forcing the shape of y==i to a 1D array ((768,)), e.g. by changing your call to scatter to
plt.scatter(principalComponents[(y == i).reshape(-1), 0],
principalComponents[(y == i).reshape(-1), 1],
color=color, alpha=.8, lw=lw, label=target_name)
which then creates the plot for me
For more information on the difference between arrays of the shape (R, 1)and (R,) this question on StackOverflow provides a nice starting point.

Related

Unable to read data from kdeplot

I have a pandas dataframe with two columns, A and B, named df in the following bits of code.
And I try to plot a kde for each value of B like so:
import seaborn as sbn, numpy as np, pandas as pd
fig = plt.figure(figsize=(15, 7.5))
sbn.kdeplot(data=df, x="A", hue="B", fill=True)
fig.savefig("test.png")
I read the following propositions but only those where I compute the kde from scratch using statsmodel or some other module get me somewhere:
Seaborn/Matplotlib: how to access line values in FacetGrid?
Get data points from Seaborn distplot
For curiosity's sake, I would like to know why I am unable to get something from the following code:
kde = sns.kdeplot(data=df, x="A", hue="B", fill=True)
line = kde.lines[0]
x, y = line.get_data()
print(x, y)
The error I get is IndexError: list index out of range. kde.lines has a length of 0.
Accessing the lines through fig.axes[0].lines[0] also raises an IndexError.
All in all, I think I tried everything proposed in the previous threads (I tried switching to displot instead of using kdeplot but this is the same story, only that I have to access axes differently, note displot and not distplot because it is deprecated), but every time I get to .get_lines(), ax.lines, ... what is returned is an empty list. So I can't get any values out of it.
EDIT : Reproducible example
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sbn
# 1. Generate random data
df = pd.DataFrame(columns=["A", "B"])
for i in [1, 2, 3, 5, 7, 8, 10, 12, 15, 17, 20, 40, 50]:
for _ in range(10):
df = df.append({"A": np.random.random() * i, "B": i}, ignore_index=True)
# 2. Plot data
fig = plt.figure(figsize=(15, 7.5))
sbn.kdeplot(data=df, x="A", hue="B", fill=True)
# 3. Read data (error)
ax = fig.axes[0]
x, y = ax.lines[0].get_data()
print(x, y)
This happens because using fill=True changes the object that matplotlib draws.
When no fill is used, lines are plotted:
fig = plt.figure(figsize=(15, 7.5))
ax = sbn.kdeplot(data=df, x="A", hue="B")
print(ax.lines)
# [<matplotlib.lines.Line2D object at 0x000001F365EF7848>, etc.]
when you use fill, it changes them to PolyCollection objects
fig = plt.figure(figsize=(15, 7.5))
ax = sbn.kdeplot(data=df, x="A", hue="B", fill=True)
print(ax.collections)
# [<matplotlib.collections.PolyCollection object at 0x0000016EE13F39C8>, etc.]
You could draw the kdeplot a second time, but with fill=False so that you have access to the line objects

Pyplot: subsequent plots with a gradient of colours [duplicate]

I am plotting multiple lines on a single plot and I want them to run through the spectrum of a colormap, not just the same 6 or 7 colors. The code is akin to this:
for i in range(20):
for k in range(100):
y[k] = i*x[i]
plt.plot(x,y)
plt.show()
Both with colormap "jet" and another that I imported from seaborn, I get the same 7 colors repeated in the same order. I would like to be able to plot up to ~60 different lines, all with different colors.
The Matplotlib colormaps accept an argument (0..1, scalar or array) which you use to get colors from a colormap. For example:
col = pl.cm.jet([0.25,0.75])
Gives you an array with (two) RGBA colors:
array([[ 0. , 0.50392157, 1. , 1. ],
[ 1. , 0.58169935, 0. , 1. ]])
You can use that to create N different colors:
import numpy as np
import matplotlib.pylab as pl
x = np.linspace(0, 2*np.pi, 64)
y = np.cos(x)
pl.figure()
pl.plot(x,y)
n = 20
colors = pl.cm.jet(np.linspace(0,1,n))
for i in range(n):
pl.plot(x, i*y, color=colors[i])
Bart's solution is nice and simple but has two shortcomings.
plt.colorbar() won't work in a nice way because the line plots aren't mappable (compared to, e.g., an image)
It can be slow for large numbers of lines due to the for loop (though this is maybe not a problem for most applications?)
These issues can be addressed by using LineCollection. However, this isn't too user-friendly in my (humble) opinion. There is an open suggestion on GitHub for adding a multicolor line plot function, similar to the plt.scatter(...) function.
Here is a working example I was able to hack together
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
def multiline(xs, ys, c, ax=None, **kwargs):
"""Plot lines with different colorings
Parameters
----------
xs : iterable container of x coordinates
ys : iterable container of y coordinates
c : iterable container of numbers mapped to colormap
ax (optional): Axes to plot on.
kwargs (optional): passed to LineCollection
Notes:
len(xs) == len(ys) == len(c) is the number of line segments
len(xs[i]) == len(ys[i]) is the number of points for each line (indexed by i)
Returns
-------
lc : LineCollection instance.
"""
# find axes
ax = plt.gca() if ax is None else ax
# create LineCollection
segments = [np.column_stack([x, y]) for x, y in zip(xs, ys)]
lc = LineCollection(segments, **kwargs)
# set coloring of line segments
# Note: I get an error if I pass c as a list here... not sure why.
lc.set_array(np.asarray(c))
# add lines to axes and rescale
# Note: adding a collection doesn't autoscalee xlim/ylim
ax.add_collection(lc)
ax.autoscale()
return lc
Here is a very simple example:
xs = [[0, 1],
[0, 1, 2]]
ys = [[0, 0],
[1, 2, 1]]
c = [0, 1]
lc = multiline(xs, ys, c, cmap='bwr', lw=2)
Produces:
And something a little more sophisticated:
n_lines = 30
x = np.arange(100)
yint = np.arange(0, n_lines*10, 10)
ys = np.array([x + b for b in yint])
xs = np.array([x for i in range(n_lines)]) # could also use np.tile
colors = np.arange(n_lines)
fig, ax = plt.subplots()
lc = multiline(xs, ys, yint, cmap='bwr', lw=2)
axcb = fig.colorbar(lc)
axcb.set_label('Y-intercept')
ax.set_title('Line Collection with mapped colors')
Produces:
Hope this helps!
An anternative to Bart's answer, in which you do not specify the color in each call to plt.plot is to define a new color cycle with set_prop_cycle. His example can be translated into the following code (I've also changed the import of matplotlib to the recommended style):
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 2*np.pi, 64)
y = np.cos(x)
n = 20
ax = plt.axes()
ax.set_prop_cycle('color',[plt.cm.jet(i) for i in np.linspace(0, 1, n)])
for i in range(n):
plt.plot(x, i*y)
If you are using continuous color pallets like brg, hsv, jet or the default one then you can do like this:
color = plt.cm.hsv(r) # r is 0 to 1 inclusive
Now you can pass this color value to any API you want like this:
line = matplotlib.lines.Line2D(xdata, ydata, color=color)
This approach seems to me like the most concise, user-friendly and does not require a loop to be used. It does not rely on user-made functions either.
import numpy as np
import matplotlib.pyplot as plt
# make 5 lines
n_lines = 5
x = np.arange(0, 2).reshape(-1, 1)
A = np.linspace(0, 2, n_lines).reshape(1, -1)
Y = x # A
# create colormap
cm = plt.cm.bwr(np.linspace(0, 1, n_lines))
# plot
ax = plt.subplot(111)
ax.set_prop_cycle('color', list(cm))
ax.plot(x, Y)
plt.show()
Resulting figure here

How to deform/scale a 3 dimensional numpy array in one dimension?

I would like to deform/scale a three dimensional numpy array in one dimension. I will visualize my problem in 2D:
I have the original image, which is a 2D numpy array:
Then I want to deform/scale it for some factor in dimension 0, or horizontal dimension:
For PIL images, there are a lot of solutions, for example in pytorch, but what if I have a numpy array of shapes (w, h, d) = (288, 288, 468)? I would like to upsample the width with a factor of 1.04, for example, to (299, 288, 468). Each cell contains a normalized number between 0 and 1.
I am not sure, if I am simply not looking for the correct vocabulary, if I try to search online. So also correcting my question would help. Or tell me the mathematical background of this problem, then I can write the code on my own.
Thank you!
You can repeat the array along the specific axis a number of times equal to ceil(factor) where factor > 1 and then evenly space indices on the stretched dimension to select int(factor * old_length) elements. This does not perform any kind of interpolation but just repeats some of the elements:
import math
import cv2
import numpy as np
from scipy.ndimage import imread
img = imread('/tmp/example.png')
print(img.shape) # (512, 512)
axis = 1
factor = 1.25
stretched = np.repeat(img, math.ceil(factor), axis=axis)
print(stretched.shape) # (512, 1024)
indices = np.linspace(0, stretched.shape[axis] - 1, int(img.shape[axis] * factor))
indices = np.rint(indices).astype(int)
result = np.take(stretched, indices, axis=axis)
print(result.shape) # (512, 640)
cv2.imwrite('/tmp/stretched.png', result)
This is the result (left is original example.png and right is stretched.png):
Looks like it is as easy as using the torch.nn.functional.interpolate functional from pytorch and choosing 'trilinear' as interpolation mode:
import torch
PET = torch.tensor(data)
print("Old shape = {}".format(PET.shape))
scale_factor_x = 1.4
# Scaling.
PET = torch.nn.functional.interpolate(PET.unsqueeze(0).unsqueeze(0),\
scale_factor=(scale_factor_x, 1, 1), mode='trilinear').squeeze().squeeze()
print("New shape = {}".format(PET.shape))
output:
>>> Old shape = torch.Size([288, 288, 468])
>>> New shape = torch.Size([403, 288, 468])
I verified the results by looking at the data, but I can't show them here due to data privacy. Sorry!
This is an example for linear up-sampling a 3D Image with scipy.interpolate, hope it helps.
(I worked quite a lot with np.meshgrid here, if you not familiar with it i recently explained it here)
import numpy as np
import matplotlib.pyplot as plt
import scipy
from scipy.interpolate import RegularGridInterpolator
# should be 1.3.0
print(scipy.__version__)
# =============================================================================
# producing a test image "image3D"
# =============================================================================
def some_function(x,y,z):
# output is a 3D Gaussian with some periodic modification
# its only for testing so this part is not impotent
out = np.sin(2*np.pi*x)*np.cos(np.pi*y)*np.cos(4*np.pi*z)*np.exp(-(x**2+y**2+z**2))
return out
# define a grid to evaluate the function on.
# the dimension of the 3D-Image will be (20,20,20)
N = 20
x = np.linspace(-1,1,N)
y = np.linspace(-1,1,N)
z = np.linspace(-1,1,N)
xx, yy, zz = np.meshgrid(x,y,z,indexing ='ij')
image3D = some_function(xx,yy,zz)
# =============================================================================
# plot the testimage "image3D"
# you will see 5 images that corresponds to the slicing of the
# z-axis similar to your example picture_
# https://sites.google.com/site/linhvtlam2/fl7_ctslices.jpg
# =============================================================================
def plot_slices(image_3d):
f, loax = plt.subplots(1,5,figsize=(15,5))
loax = loax.flatten()
for ii,i in enumerate([8,9,10,11,12]):
loax[ii].imshow(image_3d[:,:,i],vmin=image_3d.min(),vmax=image_3d.max())
plt.show()
plot_slices(image3D)
# =============================================================================
# interpolate the image
# =============================================================================
interpolation_function = RegularGridInterpolator((x, y, z), image3D, method = 'linear')
# =============================================================================
# evaluate at new grid
# =============================================================================
# create the new grid that you want
x_new = np.linspace(-1,1,30)
y_new = np.linspace(-1,1,40)
z_new = np.linspace(-1,1,N)
xx_new, yy_new, zz_new = np.meshgrid(x_new,y_new,z_new,indexing ='ij')
# change the order of the points to match the input shape of the interpolation
# function. That's a bit messy but i couldn't figure out a way around that
evaluation_points = np.rollaxis(np.array([xx_new,yy_new,zz_new]),0,4)
interpolated = interpolation_function(evaluation_points)
plot_slices(interpolated)
The original (20,20,20) dimensional 3D Image:
And the upsampeled (30,40,20) dimensional 3D Image:

Python Shape function for k-means clustering

I have one geotiff grey scale image which gave me the (4377, 6172) 2D array. In the first part, I am considering (:1024, :1024) values(Total values are -> 1024 * 1024 = 1048576) for my compression algorithm. Through this algorithm, I am getting total 4 values in finalmatrix list var through the algorithm. After this, I am applying K-means algorithm on that values. A program is below :
import numpy as np
from osgeo import gdal
from sklearn import cluster
import matplotlib.pyplot as plt
dataset =gdal.Open("1.tif")
band = dataset.GetRasterBand(1)
img = band.ReadAsArray()
finalmat = [255, 0, 2, 2]
#Converting list to array for dimensional change
ay = np.asarray(finalmat).reshape(-1,1)
fig = plt.figure()
k_means = cluster.KMeans(n_clusters=2)
k_means.fit(ay)
cluster_means = k_means.cluster_centers_.squeeze()
a_clustered = k_means.labels_
print('# of observation :',ay.shape)
print('Cluster Means : ', cluster_means)
a_clustered.shape= img.shape
fig=plt.figure(figsize=(125,125))
ax = plt.subplot(2,4,8)
plt.axis('off')
xlabel = str(1) , ' clusters'
ax.set_title(xlabel)
plt.imshow(a_clustered)
plt.show()
fig.savefig('kmeans-1 clust ndvi08jan2010_guj 12 .png')
In the above Program I am getting error in the line a_clustered.shape= img.shape. The error which I am getting is below:
Error line:
a_clustered.shape= img.shape
ValueError: cannot reshape array of size 4 into shape (4377,6172)
<matplotlib.figure.Figure at 0x7fb7c63975c0>
Actually, I want to visualize the clustering on Original image through compressed value which I am getting. Can you please give suggestion what to do
It does not make a lot of sense to use KMeans on 1 dimensional data.
And it makes even less sense to use it on a 4 x 1 array!
Your site then comes from the fact that you can't just resize a 4 x 1 integer array into a large picture.
Just print the array a_clustered you are trying to plot. It probably contains [0, 1, 1, 1].

How to plot accuracy bars for each feature of an array

I have a data set "x" and its label vector "y". I want to plot the accuracy for each attribute (for each column of "x") after applying NaiveBayes and cross-validation. I want a bar graph.
So at the end I need to have 3 bars, because "x" has 3 columns. And the classification has to run 3 times. 3 different accuracies for each feature.
Whenever I execute my code it shows:
ValueError: Found arrays with inconsistent numbers of samples: [1 3]
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
What am I doing wrong?
import matplotlib.pyplot as plt
import numpy as np
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
x = np.array([[0, 0.51, 0.00101], [3, 0.54, 0.00105], [6, 0.57, 0.00108], [9, 0.60, 0.00111], [1, 0.73, 0.00114], [5, 0.76, 0.00117], [8, 0.89, 120]])
y = np.array([1, 0, 0, 1, 1, 1, 0])
scores = list()
scores_std = list()
for i in range(x.shape[1]):
xA=x[:, i]
scoresKF2 = cross_validation.cross_val_score(clf, xA, y, cv=2)
scores.append(np.mean(scoresKF2))
scores_std.append(np.std(scoresKF2))
plt.bar(x[:,i], scores)
plt.show()
Checking the shape of your input data, xA, shows us that it is 1-dimensional -- specifically, it is (7,) shape. As the warning tells us, you are not allowed to pass in a 1d array here. The key to solving this in the warning that was returned Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. Therefore, since it is just a single feature, do this xA = x[:,i].reshape(-1, 1) instead of xA = x[:,i].
I think there is another issue with the plotting. I'm not completely sure what you are expecting to see but you should probably replace plt.bar(x[:,i], scores) with plt.bar(i, np.mean(scoresKF2)).

Resources