Anaconda/Jupyter Notebook/Python 3 KeyError: "Columns not found" - python-3.x

I am trying to make a bar graph with Anaconda/Python 3 with the following code but I keep on getting an error. I first read a dataset that has been formatted to be an excel/csv file and that works fine because I am able to make correlation plots as well as some other plots successfully. I tried the same code with both an excel file and csv file that contain the same data and I get the same error. When the program attempts to execute the line responsible for making a bar graph, I get the following error:
KeyError Traceback (most recent call last)
<ipython-input-4-b72668216a77> in <module>
64
65 # Compare mean and standard deviation between attributes
---> 66 compare = dataset.groupby("target_class")[['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile', 'mean_dmsnr_curve', 'std_dmsnr_curve', 'kurtosis_dmsnr_curve', 'skewness_dmsnr_curve']].mean().reset_index()
67 # compare = dataset.groupby("target_class")[['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile', 'mean_dmsnr_curve', 'std_dmsnr_curve', 'kurtosis_dmsnr_curve', 'skewness_dmsnr_curve']].mean().reset_index()
68
~\Anaconda3\lib\site-packages\pandas\core\base.py in __getitem__(self, key)
263 bad_keys = list(set(key).difference(self.obj.columns))
264 raise KeyError("Columns not found: {missing}"
--> 265 .format(missing=str(bad_keys)[1:-1]))
266 return self._gotitem(list(key), ndim=2)
267
KeyError: "Columns not found: 'mean_dmsnr_curve', 'std_profile', 'std_dmsnr_curve', 'skewness_dmsnr_curve', 'mean_profile', 'kurtosis_profile', 'skewness_profile', 'kurtosis_dmsnr_curve'"
Other plots work fine but the comparison bar graph plot does not. Here is the code that I am trying to execute:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import itertools
warnings.filterwarnings("ignore")
#matplotlib inline
print("This is a test right before excel file is read")
# Utilize the pandas library to read the data
dataset = pd.read_csv(r'C:\userpath\Machine Learning Project\pulsar_stars_test.csv')
#dataset = pd.read_excel(r'C:\userpath\Machine Learning Project\pulsar_stars_test.xlsx')
# Print the number of rows and columns that the data has to the user
print("This is the number of rows: ", dataset.shape[0])
print("This is the number of columns: ", dataset.shape[1])
# Use pandas to print out the information about the data
print("This is the data information: ", dataset.info())
# Use pandas to display information about missing data
print("This is the missing data: ", dataset.isnull().sum())
# Make a figure appear to display a dataset summary to the user
plt.figure(figsize = (12, 8))
sns.heatmap(dataset.describe()[1:].transpose(), annot = True, linecolor = "w", linewidth = 2, cmap = sns.color_palette("Set2"))
plt.title("Data Summary")
plt.show()
# Instantiate another figure to display some correlation data to the user
correlation = dataset.corr()
plt.figure(figsize = (10, 8))
sns.heatmap(correlation, annot = True, cmap = sns.color_palette("magma"), linewidth = 2, edgecolor = "k")
plt.title("CORRELATION BETWEEN VARIABLES")
plt.show()
# Compute the proportion of each target variabibble in the dataset
plt.figure(figsize = (12, 6))
plt.subplot(121)
ax = sns.countplot(y = dataset["target_class"], palette = ["r", "g"], linewidth = 1, edgecolor = "k"*2)
for i, j in enumerate(dataset["target_class"].value_counts().values):
ax.text(.7, i, j, weight = "bold", fontsize = 27)
plt.title("Count for target variable in dataset")
plt.subplot(122)
plt.pie(dataset["target_class"].value_counts().values, labels = ["not pulsar stars", "pulsar stars"], autopct = "%1.0f%%", wedgeprops = {"linewidth":2, "edgecolor":"white"})
#plt.pie(data["target_class"].value_counts().values, labels = ["not pulsar stars", "pulsar stars"], autopct = "%1.0f%%", wedgeprops = {"linewidth":2, "edgecolor":"white"})
my_circ = plt.Circle((0,0), .7, color = "white")
plt.gca().add_artist(my_circ)
plt.subplots_adjust(wspace = .2)
plt.title("Proportion of target variabibble in dataset")
plt.show()
# Compare mean and standard deviation between attributes
compare = dataset.groupby("target_class")[['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile', 'mean_dmsnr_curve', 'std_dmsnr_curve', 'kurtosis_dmsnr_curve', 'skewness_dmsnr_curve']].mean().reset_index()
# compare = dataset.groupby("target_class")[['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile', 'mean_dmsnr_curve', 'std_dmsnr_curve', 'kurtosis_dmsnr_curve', 'skewness_dmsnr_curve']].mean().reset_index()
compare = compare.drop("target_class", axis = 1)
compare.plot(kind = "bar", width = 0.6, figsize = (13,6), colormap = "Set2")
plt.grid(True, alpha = 0.3)
plt.title("COMPARING MEAN OF ATTRIBUTES FOR TARGET CLASSES")
# Second comparison plot
compare1.dataset.groupby("target_class")[['mean_profile', 'std_profile', 'kurtosis_profile', 'mean_dmsnr_curve', 'std_dmsnr_curve','kurtosis_dmsnr_curve','skewness_dmsnr_curve']].mean().reset_index()
compare1 = compare1.drop("target_class", axis = 1)
compare1.plot(kind = "bar", width = 0.6, figsize = (13, 6), colormap = "Set2")
plt.grid(True, alpha = 0.3)
plt.title("COMPARING STANDARD DEVIATION OF ATTRIBUTES FOR TARGET CLASSES")
plt.show()
The program fails when it hits the line (line 66) at which the groupby function is called for the first time in the program. Does anybody know how to fix this?

Plz update your version of pandas ie package whatever u r using....plz update that...then and then key error will be solve

Related

ValueError: x and y must have same first dimension, but have shapes (2140699,) and (4281398,)

I use miniconda jupyter notebook python and I'm trying to implement a machine (Audio filtering). I got this error and I really don't know how to fix it.
Here I imported libraries that I need with the path of the file:
import wave as we
import numpy as np
import matplotlib.pyplot as plt
dir = r'/home/pc/Downloads/Bubble audios'
Here the fuction that should plot the graph:
def read_wav(wavfile, plots=True, normal=False):
f = wavfile
params = f.getparams()
# print(params)
nchannels, sampwidth, framerate, nframes = params[:4]
strData = f.readframes(nframes) # , string format
waveData = np.frombuffer(strData, dtype=np.int16) # Convert a string to an int
# wave amplitude normalization
if normal == True:
waveData = waveData*1.0/(max(abs(waveData)))
#
if plots == True:
time = np.arange(0, nframes ,dtype=np.int16) *(1.0 / framerate)
plt.figure(dpi=100)
plt.plot(time, waveData)
plt.xlabel("Time")
plt.ylabel("Amplitude")
plt.title("Single channel wavedata")
plt.show()
return (Wave, time)
def fft_wav(waveData, plots=True):
f_array = np.fft.fft(waveData) # Fourier transform, the result is a complex array
f_abs = f_array
axis_f = np.linspace(0, 250, np.int(len(f_array)/2)) # map to 250
# axis_f = np.linspace(0, 250, np.int(len(f_array))) # map to 250
if plots == True:
plt.figure(dpi=100)
plt.plot(axis_f, np.abs(f_abs[0:len(axis_f)]))
# plt.plot(axis_f, np.abs(f_abs))
plt.xlabel("Frequency")
plt.ylabel("Amplitude spectrum")
plt.title("Tile map")
plt.show()
return f_abs
And here I call the function with the file that I want to be read and plotted.
f = we.open(dir+r'/Ars1_Aufnahme.wav', 'rb')
Wave, time = read_wav(f)
The error that I got:
ValueError: x and y must have same first dimension, but have shapes (2140699,) and (4281398,)
I tried to use np.reshape but it didn't work or I might have used it wrong. So, any advice?
it's seems that your time is 1/2 of the size of your wave. Maybe your nframe is too short. If you do nframses = 2*nframes what is the error ?

Coordinate conversion problem of a FITS file

I have loaded and plotted a FITS file in python.
With the help of a previous post, I have managed to get the conversion of the axis from pixels to celestial coordinates. But I can't manage to get them in milliarcseconds (mas) correctly.
The code is the following
import numpy as np
import matplotlib.pyplot as plt
import astropy.units as u
from astropy.wcs import WCS
from astropy.io import fits
from astropy.utils.data import get_pkg_data_filename
filename = get_pkg_data_filename('hallo.fits')
hdu = fits.open(filename)[0]
wcs = WCS(hdu.header).celestial
wcs.wcs.crval = [0,0]
plt.subplot(projection=wcs)
plt.imshow(hdu.data[0][0], origin='lower')
plt.xlim(200,800)
plt.ylim(200,800)
plt.xlabel('Relative R.A ()')
plt.ylabel('Relative Dec ()')
plt.colorbar()
The output looks like
The y-label is cut for some reason, I do not know.
As it was shown in another post, one could use
wcs.wcs.ctype = [ 'XOFFSET' , 'YOFFSET' ]
to switch it to milliarcsecond, and I get
but the scale is incorrect!.
For instance, 0deg00min00.02sec should be 20 mas and not 0.000002!
Did I miss something here?
Looks like a spectral index map. Nice!
I think the issue might be that FITS implicitly uses degrees for values like CDELT. And they should be converted to mas explicitly for the plot.
The most straightforward way is to multiply CDELT values by 3.6e6 to convert from degrees to mas.
However, there is a more general approach which could be useful if you want to convert to different units at some point:
import astropy.units as u
w.wcs.cdelt = (w.wcs.cdelt * u.deg).to(u.mas)
So it basically says first that the units of CDELT are degrees and then converts them to mas.
The whole workflow is like this:
def make_transform(f):
'''use already read-in FITS file object f to build pixel-to-mas transformation'''
print("Making a transformation out of a FITS header")
w = WCS(f[0].header)
w = w.celestial
w.wcs.crval = [0, 0]
w.wcs.ctype = [ 'XOFFSET' , 'YOFFSET' ]
w.wcs.cunit = ['mas' , 'mas']
w.wcs.cdelt = (w.wcs.cdelt * u.deg).to(u.mas)
print(w.world_axis_units)
return w
def read_fits(file):
'''read fits file into object'''
try:
res = fits.open(file)
return res
except:
return None
def start_plot(i,df=None, w=None, xlim = [None, None], ylim=[None, None]):
'''starts a plot and returns fig,ax .
xlim, ylim - axes limits in mas
'''
# make a transformation
# Using a dataframe
if df is not None:
w = make_transform_df(df)
# using a header
if w is not None:
pass
# not making but using one from the arg list
else:
w = make_transform(i)
# print('In start_plot using the following transformation:\n {}'.format(w))
fig = plt.figure()
if w.naxis == 4:
ax = plt.subplot(projection = w, slices = ('x', 'y', 0 ,0 ))
elif w.naxis == 2:
ax = plt.subplot(projection = w)
# convert xlim, ylim to coordinates of BLC and TRC, perform transformation, then return back to xlim, ylim in pixels
if any(xlim) and any(ylim):
xlim_pix, ylim_pix = limits_mas2pix(xlim, ylim, w)
ax.set_xlim(xlim_pix)
ax.set_ylim(ylim_pix)
fig.add_axes(ax) # note that the axes have to be explicitly added to the figure
return fig, ax
rm = read_fits(file)
wr = make_transform(rm)
fig, ax = start_plot(RM, w=wr, xlim = xlim, ylim = ylim)
Then just plot to the axes ax with imshow or contours or whatever.
Of course, this piece of code could be reduced to meet your particular needs.

How to draw vertical average lines for overlapping histograms in a loop

I'm trying to draw with matplotlib two average vertical line for every overlapping histograms using a loop. I have managed to draw the first one, but I don't know how to draw the second one. I'm using two variables from a dataset to draw the histograms. One variable (feat) is categorical (0 - 1), and the other one (objective) is numerical. The code is the following:
for chas in df[feat].unique():
plt.hist(df.loc[df[feat] == chas, objective], bins = 15, alpha = 0.5, density = True, label = chas)
plt.axvline(df[objective].mean(), linestyle = 'dashed', linewidth = 2)
plt.title(objective)
plt.legend(loc = 'upper right')
I also have to add to the legend the mean and standard deviation values for each histogram.
How can I do it? Thank you in advance.
I recommend you using axes to plot your figure. Pls see code below and the artist tutorial here.
import numpy as np
import matplotlib.pyplot as plt
# Fixing random state for reproducibility
np.random.seed(19680801)
mu1, sigma1 = 100, 8
mu2, sigma2 = 150, 15
x1 = mu1 + sigma1 * np.random.randn(10000)
x2 = mu2 + sigma2 * np.random.randn(10000)
fig, ax = plt.subplots(1, 1, figsize=(7.2, 7.2))
# the histogram of the data
lbs = ['a', 'b']
colors = ['r', 'g']
for i, x in enumerate([x1, x2]):
n, bins, patches = ax.hist(x, 50, density=True, facecolor=colors[i], alpha=0.75, label=lbs[i])
ax.axvline(bins.mean())
ax.legend()

Draw 3D Plot for Gensim model

I have trained my model using Gensim. I draw a 2D plot using PCA but it is not clear too much. I wanna change it to 3D with capable of zooming .my result is so dense.
from sklearn.decomposition import PCA
from matplotlib import pyplot
X=model[model.wv.vocab]
pca=PCA(n_components=2)
result=pca.fit_transform(X)
pyplot.scatter(result[:,0],result[:,1])
word=list(model.wv.most_similar('eden_lake'))
for i, word in enumerate(words):
pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()
And the result:
it possible to do that?
The following function uses t-SNE instead of PCA for dimension reduction, but will generate a plot in two, three or both two and three dimensions (using subplots). Furthermore, it will color the topics for you so it's easier to distinguish them. Adding %matplotlib notebook to the start of a Jupyter notebook environment from anaconda will allow a 3d plot to be rotated and a 2d plot to be zoomed (don't do both versions at the same time with %matplotlib notebook).
The function is very long, with most of the code being for plot formatting, but produces a professional output.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import seaborn as sns
from gensim.models import LdaModel
from gensim import corpora
from sklearn.manifold import TSNE
# %matplotlib notebook # if in Jupyter for rotating and zooming
def LDA_tSNE_topics_vis(dimension='both',
corpus=None,
num_topics=10,
remove_3d_outliers=False,
save_png=False):
"""
Returns the outputs of an LDA model plotted using t-SNE (t-distributed Stochastic Neighbor Embedding)
Note: t-SNE reduces the dimensionality of a space such that similar points will be closer and dissimilar points farther
Parameters
----------
dimension : str (default=both)
The dimension that t-SNE should reduce the data to for visualization
Options: 2d, 3d, and both (a plot with two subplots)
corpus : list, list of lists
The tokenized and cleaned text corpus over which analysis should be done
num_topics : int (default=10)
The number of categories for LDA based approaches
remove_3d_outliers : bool (default=False)
Whether to remove outliers from a 3d plot
save_png : bool (default=False)
Whether to save the figure as a png
Returns
-------
A t-SNE lower dimensional representation of an LDA model's topics and their constituent members
"""
dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
dirichlet_model = LdaModel(corpus=bow_corpus,
id2word=dirichlet_dict,
num_topics=num_topics,
update_every=1,
chunksize=len(bow_corpus),
passes=10,
alpha='auto',
random_state=42) # set for testing
df_topic_coherences = pd.DataFrame(columns = ['topic_{}'.format(i) for i in range(num_topics)])
for i in range(len(bow_corpus)):
df_topic_coherences.loc[i] = [0] * num_topics
output = dirichlet_model.__getitem__(bow=bow_corpus[i], eps=0)
for j in range(len(output)):
topic_num = output[j][0]
coherence = output[j][1]
df_topic_coherences.iloc[i, topic_num] = coherence
for i in range(num_topics):
df_topic_coherences.iloc[:, i] = df_topic_coherences.iloc[:, i].astype('float64', copy=False)
df_topic_coherences['main_topic'] = df_topic_coherences.iloc[:, :num_topics].idxmax(axis=1)
if num_topics > 10:
# cubehelix better for more than 10 colors
colors = sns.color_palette("cubehelix", num_topics)
else:
# The default sns color palette
colors = sns.color_palette('deep', num_topics)
tsne_2 = None
tsne_3 = None
if dimension == 'both':
tsne_2 = TSNE(n_components=2, perplexity=40, n_iter=300)
tsne_3 = TSNE(n_components=3, perplexity=40, n_iter=300)
elif dimension == '2d':
tsne_2 = TSNE(n_components=2, perplexity=40, n_iter=300)
elif dimension == '3d':
tsne_3 = TSNE(n_components=3, perplexity=40, n_iter=300)
else:
ValueError("An invalid value has been passed to the 'dimension' argument - choose from 2d, 3d, or both.")
if tsne_2 is not None:
tsne_results_2 = tsne_2.fit_transform(df_topic_coherences.iloc[:, :num_topics])
df_tsne_2 = pd.DataFrame()
df_tsne_2['tsne-2d-d1'] = tsne_results_2[:,0]
df_tsne_2['tsne-2d-d2'] = tsne_results_2[:,1]
df_tsne_2['main_topic'] = df_topic_coherences.iloc[:, num_topics]
df_tsne_2['color'] = [colors[int(t.split('_')[1])] for t in df_tsne_2['main_topic']]
df_tsne_2['topic_num'] = [int(i.split('_')[1]) for i in df_tsne_2['main_topic']]
df_tsne_2 = df_tsne_2.sort_values(['topic_num'], ascending = True).drop('topic_num', axis=1)
if tsne_3 is not None:
colors = [c for c in sns.color_palette()]
tsne_results_3 = tsne_3.fit_transform(df_topic_coherences.iloc[:, :num_topics])
df_tsne_3 = pd.DataFrame()
df_tsne_3['tsne-3d-d1'] = tsne_results_3[:,0]
df_tsne_3['tsne-3d-d2'] = tsne_results_3[:,1]
df_tsne_3['tsne-3d-d3'] = tsne_results_3[:,2]
df_tsne_3['main_topic'] = df_topic_coherences.iloc[:, num_topics]
df_tsne_3['color'] = [colors[int(t.split('_')[1])] for t in df_tsne_3['main_topic']]
df_tsne_3['topic_num'] = [int(i.split('_')[1]) for i in df_tsne_3['main_topic']]
df_tsne_3 = df_tsne_3.sort_values(['topic_num'], ascending = True).drop('topic_num', axis=1)
if remove_3d_outliers:
# Remove those rows with values that are more than three standard deviations from the column mean
for col in ['tsne-3d-d1', 'tsne-3d-d2', 'tsne-3d-d3']:
df_tsne_3 = df_tsne_3[np.abs(df_tsne_3[col] - df_tsne_3[col].mean()) <= (3 * df_tsne_3[col].std())]
if tsne_2 is not None and tsne_3 is not None:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, # pylint: disable=unused-variable
figsize=(20,10))
ax1.axis('off')
else:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(20,10))
if tsne_2 is not None and tsne_3 is not None:
# Plot tsne_2, with tsne_3 being added later
ax1 = sns.scatterplot(data=df_tsne_2, x="tsne-2d-d1", y="tsne-2d-d2",
hue=df_topic_coherences.iloc[:, num_topics], alpha=0.3)
light_grey_tup = (242/256, 242/256, 242/256)
ax1.set_facecolor(light_grey_tup)
ax1.axes.set_title('t-SNE 2-Dimensional Representation', fontsize=25)
ax1.set_xlabel('tsne-d1', fontsize=20)
ax1.set_ylabel('tsne-d2', fontsize=20)
handles, labels = ax1.get_legend_handles_labels()
legend_order = list(np.argsort([i.split('_')[1] for i in labels]))
ax1.legend([handles[i] for i in legend_order], [labels[i] for i in legend_order],
facecolor=light_grey_tup)
elif tsne_2 is not None:
# Plot just tsne_2
ax = sns.scatterplot(data=df_tsne_2, x="tsne-2d-d1", y="tsne-2d-d2",
hue=df_topic_coherences.iloc[:, num_topics], alpha=0.3)
ax.set_facecolor(light_grey_tup)
ax.axes.set_title('t-SNE 2-Dimensional Representation', fontsize=25)
ax.set_xlabel('tsne-d1', fontsize=20)
ax.set_ylabel('tsne-d2', fontsize=20)
handles, labels = ax.get_legend_handles_labels()
legend_order = list(np.argsort([i.split('_')[1] for i in labels]))
ax.legend([handles[i] for i in legend_order], [labels[i] for i in legend_order],
facecolor=light_grey_tup)
if tsne_2 is not None and tsne_3 is not None:
# tsne_2 has been plotted, so add tsne_3
ax2 = fig.add_subplot(121, projection='3d')
ax2.scatter(xs=df_tsne_3['tsne-3d-d1'],
ys=df_tsne_3['tsne-3d-d2'],
zs=df_tsne_3['tsne-3d-d3'],
c=df_tsne_3['color'],
alpha=0.3)
ax2.set_facecolor('white')
ax2.axes.set_title('t-SNE 3-Dimensional Representation', fontsize=25)
ax2.set_xlabel('tsne-d1', fontsize=20)
ax2.set_ylabel('tsne-d2', fontsize=20)
ax2.set_zlabel('tsne-d3', fontsize=20)
with plt.rc_context({"lines.markeredgewidth" : 0}):
# Add handles via blank lines and order their colors to match tsne_2
proxy_handles = [Line2D([0], [0], linestyle="none", marker='o', markersize=8,
markerfacecolor=colors[i]) for i in legend_order]
ax2.legend(proxy_handles, ['topic_{}'.format(i) for i in range(num_topics)],
loc='upper left', facecolor=(light_grey_tup))
elif tsne_3 is not None:
# Plot just tsne_3
ax.axis('off')
ax.set_facecolor('white')
ax = fig.add_subplot(111, projection='3d')
ax.scatter(xs=df_tsne_3['tsne-3d-d1'],
ys=df_tsne_3['tsne-3d-d2'],
zs=df_tsne_3['tsne-3d-d3'],
c=df_tsne_3['color'],
alpha=0.3)
ax.set_facecolor('white')
ax.axes.set_title('t-SNE 3-Dimensional Representation', fontsize=25)
ax.set_xlabel('tsne-d1', fontsize=20)
ax.set_ylabel('tsne-d2', fontsize=20)
ax.set_zlabel('tsne-d3', fontsize=20)
with plt.rc_context({"lines.markeredgewidth" : 0}):
# Add handles via blank lines
proxy_handles = [Line2D([0], [0], linestyle="none", marker='o', markersize=8,
markerfacecolor=colors[i]) for i in range(len(colors))]
ax.legend(proxy_handles, ['topic_{}'.format(i) for i in range(num_topics)],
loc='upper left', facecolor=light_grey_tup)
if save_png:
plt.savefig('LDA_tSNE_{}.png'.format(time.strftime("%Y%m%d-%H%M%S")), bbox_inches='tight', dpi=500)
plt.show()
An example plot for both 2d and 3d (with outliers removed) representations of a 10 topic gensim LDA model on subplots would be:
Yes, in principle it is possible to do 3D visualization of LDA model results. Here is more information about using T-SNE for that.

Issue when trying to plot after applying PCA on a dataset

I am trying to plot the results of PCA of the dataset pima-indians-diabetes.csv. My code shows a problem only in the plotting piece:
import numpy
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
# Dataset Description:
# 1. Number of times pregnant
# 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
# 3. Diastolic blood pressure (mm Hg)
# 4. Triceps skin fold thickness (mm)
# 5. 2-Hour serum insulin (mu U/ml)
# 6. Body mass index (weight in kg/(height in m)^2)
# 7. Diabetes pedigree function
# 8. Age (years)
# 9. Class variable (0 or 1)
path = 'pima-indians-diabetes.data.csv'
dataset = numpy.loadtxt(path, delimiter=",")
X = dataset[:,0:8]
Y = dataset[:,8]
features = ['1','2','3','4','5','6','7','8','9']
df = pd.read_csv(path, names=features)
x = df.loc[:, features].values # Separating out the values
y = df.loc[:,['9']].values # Separating out the target
x = StandardScaler().fit_transform(x) # Standardizing the features
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
# principalDf = pd.DataFrame(data=principalComponents, columns=['pca1', 'pca2'])
# finalDf = pd.concat([principalDf, df[['9']]], axis = 1)
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
for color, i, target_name in zip(colors, [0, 1, 2], ['Negative', 'Positive']):
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of pima-indians-diabetes Dataset')
The error is located at the following line:
Traceback (most recent call last):
File "test.py", line 53, in <module>
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
IndexError: too many indices for array
Kindly, how to fix this?
As the error indicates some kind of shape/dimension mismatch, a good starting point is to check the shapes of the arrays involved in the operation:
principalComponents.shape
yields
(768, 2)
while
(y==i).shape
(768, 1)
Which leads to a shape mismatch when trying to run
principalComponents[y==i, 0]
as the first array is already multidimensional, therefore the error is indicating that you used too many indices for the array.
You can fix this by forcing the shape of y==i to a 1D array ((768,)), e.g. by changing your call to scatter to
plt.scatter(principalComponents[(y == i).reshape(-1), 0],
principalComponents[(y == i).reshape(-1), 1],
color=color, alpha=.8, lw=lw, label=target_name)
which then creates the plot for me
For more information on the difference between arrays of the shape (R, 1)and (R,) this question on StackOverflow provides a nice starting point.

Resources