Don't understand error message (basic sklearn command) - scikit-learn

I'm new to Python and programming in general and I wanted to exercise a littlebit with linear regression in one variable.
Im currently following this tutorial in the link
https://www.youtube.com/watch?v=8jazNUpO3lQ&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=2
and I am exactly doing what he is doing.
I did however encounter an error when compiling as shown in the code below
(for simplicity, I put '--' to places which is the output. I used Jupyter Notebook)
At the end I encounterd a long list of errors when trying to compile 'reg.predict(3300)'.
I don't understand what went wrong.
Can someone help me out?
Cheers!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
df = pd.read_csv("homeprices.csv")
df
--area price
0 2600 550000
1 3000 565000
2 3200 610000
3 3600 680000
4 4000 725000
%matplotlib inline
plt.xlabel('area(sqr ft)')
plt.ylabel('price(US$)')
plt.scatter(df.area, df.price, color='red', marker = '+')
--<matplotlib.collections.PathCollection at 0x2e823ce66a0>
reg = linear_model.LinearRegression()
reg.fit(df[['area']],df.price)
--LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
reg.predict(3300)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-ad5a8409ff75> in <module>
----> 1 reg.predict(3300)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
211 Returns predicted values.
212 """
--> 213 return self._decision_function(X)
214
215 _preprocess_data = staticmethod(_preprocess_data)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
194 check_is_fitted(self, "coef_")
195
--> 196 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
197 return safe_sparse_dot(X, self.coef_.T,
198 dense_output=True) + self.intercept_
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
543 "Reshape your data either using array.reshape(-1, 1) if "
544 "your data has a single feature or array.reshape(1, -1) "
--> 545 "if it contains a single sample.".format(array))
546 # If input is 1D raise error
547 if array.ndim == 1:
ValueError: Expected 2D array, got scalar array instead:
array=3300.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Try reg.predict([[3300]]). The api used to allow scalar value but now you need to give 2D array

reg.fit(df[['area']],df.price)
I think above we are using 2 variables, so using 2D array to fit [X]. we need to use 2D array in reg.predict for [X],too. Hence,
reg.predict([[3300]])

Expected 2D array,got scalar array instead: this is written in the error explained box so
kindly change it to :
just wrote it like this
reg.predict([[3300]])

Related

How do I correctly write the syntax for performing and plotting a for loop operation?

I am trying to create a for loop which uses a defined function (B_lambda) and takes in values of wavelength and temperature to produce values of intensity. i.e. I want the loop to take the function B_lambda and to run through every value within my listed wavelength range for each temperature in the temperature list. Then I want to plot the results. I am not very good with the syntax and have tried many ways but nothing is producing what I need and I am mostly getting errors. I have no idea how to use a for loop to plot and all online sources that I have checked out have not helped me with using a defined function in a for loop. I will put my latest code that seems to have the least errors down below with the error message:
import matplotlib.pylab as plt
import numpy as np
from astropy import units as u
import scipy.constants
%matplotlib inline
#Importing constants to use.
h = scipy.constants.h
c = scipy.constants.c
k = scipy.constants.k
wavelengths= np.arange(1000,30000)*1.e-10
temperature=[3000,4000,5000,6000]
for lam in wavelengths:
for T in temperature:
B_lambda = ((2*h*c**2)/(lam**5))*((1)/(np.exp((h*c)/(lam*k*T))-1))
plt.figure()
plt.plot(wavelengths,B_lambda)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-73b866241c49> in <module>
17 B_lambda = ((2*h*c**2)/(lam**5))*((1)/(np.exp((h*c)/(lam*k*T))-1))
18 plt.figure()
---> 19 plt.plot(wavelengths,B_lambda)
20
21
/usr/local/lib/python3.6/dist-packages/matplotlib/pyplot.py in plot(scalex, scaley, data, *args, **kwargs)
2787 return gca().plot(
2788 *args, scalex=scalex, scaley=scaley, **({"data": data} if data
-> 2789 is not None else {}), **kwargs)
2790
2791
/usr/local/lib/python3.6/dist-packages/matplotlib/axes/_axes.py in plot(self, scalex, scaley, data, *args, **kwargs)
1663 """
1664 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D._alias_map)
-> 1665 lines = [*self._get_lines(*args, data=data, **kwargs)]
1666 for line in lines:
1667 self.add_line(line)
/usr/local/lib/python3.6/dist-packages/matplotlib/axes/_base.py in __call__(self, *args, **kwargs)
223 this += args[0],
224 args = args[1:]
--> 225 yield from self._plot_args(this, kwargs)
226
227 def get_next_color(self):
/usr/local/lib/python3.6/dist-packages/matplotlib/axes/_base.py in _plot_args(self, tup, kwargs)
389 x, y = index_of(tup[-1])
390
--> 391 x, y = self._xy_from_xy(x, y)
392
393 if self.command == 'plot':
/usr/local/lib/python3.6/dist-packages/matplotlib/axes/_base.py in _xy_from_xy(self, x, y)
268 if x.shape[0] != y.shape[0]:
269 raise ValueError("x and y must have same first dimension, but "
--> 270 "have shapes {} and {}".format(x.shape, y.shape))
271 if x.ndim > 2 or y.ndim > 2:
272 raise ValueError("x and y can be no greater than 2-D, but have "
ValueError: x and y must have same first dimension, but have shapes (29000,) and (1,)```
First thing to note (and this is minor) is that astropy is not required to run your code. So, you can simplify the import statements.
import matplotlib.pylab as plt
import numpy as np
import scipy.constants
%matplotlib inline
#Importing constants to use.
h = scipy.constants.h
c = scipy.constants.c
k = scipy.constants.k
wavelengths= np.arange(1000,30000,100)*1.e-10 # here, I chose steps of 100, because plotting 29000 datapoints takes a while
temperature=[3000,4000,5000,6000]
Secondly, to tidy up the loop a bit, you can write a helper function, that youn call from within you loop:
def f(lam, T):
return ((2*h*c**2)/(lam**5))*((1)/(np.exp((h*c)/(lam*k*T))-1))
now you can collect the output of your function, together with the input parameters, e.g. in a list of tuples:
outputs = []
for lam in wavelengths:
for T in temperature:
outputs.append((lam, T, f(lam, T)))
Since you vary both wavelength and temperature, a 3d plot makes sense:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')
ax.plot(*zip(*outputs))
An alternative would be to display the data as an image, using colour to indicate the function output.
I am also including an alternative method to generate the data in this one. Since the function f can take arrays as input, you can feed one temperature at a time, and with it, all the wavelengths simultaneously.
# initialise output as array with proper shape
outputs = np.zeros((len(wavelengths), len(temperature)))
for i, T in enumerate(temperature):
outputs[:,i] = f(wavelengths, T)
The output now is a large matrix, which you can visualise as an image:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.imshow(outputs, aspect=10e8, interpolation='none',
extent=[
np.min(temperature),
np.max(temperature),
np.max(wavelengths),
np.min(wavelengths)]
)

How to do clustering with k-means algorithm for an imported data set with proper scaling of both axis

I m new to data science and python, and jupyter notebook, I m currently studying how to do k means clustering on a data set. I came across ways in which can introduce data
Data = {'x': [25,34,22,27,33,33,31,22,35,34,67,54,57,43,50,57,59,52,65,47,49,48,35,33,44,45,38,43,51,46],
'y': [79,51,53,78,59,74,73,57,69,75,51,32,40,47,53,36,35,58,59,50,25,20,14,12,20,5,29,27,8,7]
}
df = DataFrame(Data,columns=['x','y'])
and use of blobs
data = make_blobs(n_samples=200, n_features=2, centers=4, cluster_std=1.6, random_state=50)
but I would like to know how to do a proper code with a csv file imported from my computer and do a k means with scaling, thank you in advance, I could not find relevant blogs to help me
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans
data=pd.read_csv("C:/Users/Dulangi/Downloads/winequality-red.csv")
data
data["alcohol"]=data["alcohol"]/data["alcohol"].max()
data["quality"]=data["quality"]/data["quality"].max()
plt.scatter(data["alcohol"],data['quality'])
plt.xlabel("alcohol")
plt.ylabel('quality')
plt.show()
x=data.copy()
kmeans=KMeans(2)
kmeans.fit(x)
clusters=x.copy()
clusters['cluster_pred']=kmeans.fit_predict(x)
plt.scatter(clusters["alcohol"],clusters['quality'],c=clusters['cluster_pred'],cmap='rainbow')
plt.xlabel("alcohol")
plt.ylabel('quality')
plt.show()
from sklearn import preprocessing
x_scaled=preprocessing.scale(x)
#x_scaled
wcss=[]
for i in range(1,30):
kmeans=KMeans(i)
kmeans.fit(x_scaled)
wcss.append(kmeans.inertia_)
wcss
plt.plot(range(1,30),wcss)
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
This is what i tried
the error i got
ValueError Traceback (most recent call last)
<ipython-input-12-d4955ce8615e> in <module>
39
40
---> 41 plt.plot(range(1,30),wcss)
42 plt.xlabel('Number of clusters')
43 plt.ylabel('WCSS')
~\Anaconda3\lib\site-packages\matplotlib\pyplot.py in plot(scalex, scaley, data, *args, **kwargs)
2787 return gca().plot(
2788 *args, scalex=scalex, scaley=scaley, **({"data": data} if data
-> 2789 is not None else {}), **kwargs)
2790
2791
~\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in plot(self, scalex, scaley, data, *args, **kwargs)
1664 """
1665 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D._alias_map)
-> 1666 lines = [*self._get_lines(*args, data=data, **kwargs)]
1667 for line in lines:
1668 self.add_line(line)
~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in __call__(self, *args, **kwargs)
223 this += args[0],
224 args = args[1:]
--> 225 yield from self._plot_args(this, kwargs)
226
227 def get_next_color(self):
~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in _plot_args(self, tup, kwargs)
389 x, y = index_of(tup[-1])
390
--> 391 x, y = self._xy_from_xy(x, y)
392
393 if self.command == 'plot':
~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in _xy_from_xy(self, x, y)
268 if x.shape[0] != y.shape[0]:
269 raise ValueError("x and y must have same first dimension, but "
--> 270 "have shapes {} and {}".format(x.shape, y.shape))
271 if x.ndim > 2 or y.ndim > 2:
272 raise ValueError("x and y can be no greater than 2-D, but have "
ValueError: x and y must have same first dimension, but have shapes (29,) and (1,)
You can easily do by using scikit-Learn
import pandas as pd
data=pd.read_csv('myfile.csv')
df=pd.DataFrame(data,index=None)
df.head()
Check if rows contain any null values
df.isnull().sum()
Drop all the rows with null values if any
df_numeric.dropna(inplace=True)
Normalize data
Normalize the data with MinMax scaling provided by sklearn
from sklearn import preprocessing
minmax_processed = preprocessing.MinMaxScaler().fit_transform(df.drop('title',axis=1))
df_numeric_scaled = pd.DataFrame(minmax_processed, index=df.index, columns=df.columns[:-1])
df_numeric_scaled.head()
from sklearn.cluster import KMeans
Apply K-Means Clustering
What k to choose?
Let's fit cluster size 1 to 20 on our data and take a look at the corresponding score value.
Nc = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in Nc]
score = [kmeans[i].fit(df_numeric_scaled).score(df_numeric_scaled) for i in range(len(kmeans))]
These score values signify how far our observations are from the cluster center. We want to keep this score value around 0. A large positive or a large negative value would indicate that the cluster center is far from the observations.
Based on these scores value, we plot an Elbow curve to decide which cluster size is optimal. Note that we are dealing with tradeoff between cluster size(hence the computation required) and the relative accuracy.
import matplotlib as pl
pl.plot(Nc,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
Fit K-Means for clustering with k=5
kmeans = KMeans(n_clusters=5)
kmeans.fit(df_numeric_scaled)
df['cluster'] = kmeans.labels_
df.head()

ValueError: Number of priors must match number of classes

I want to compile my python3 code on ubuntu, and also want to know about the problem, such that i can handle that in future.
It seems there is some problem with the imported library function.
## sample code
1 import numpy as np
2 x = np.array([[-1,-1],[-2,-1],[-3,-2],[1,1],[2,1],[3,2]])
3 y = np.array([1,1,1,2,2,2])
4 from sklearn.naive_bayes import GaussianNB
5 clf = GaussianNB(x, y)
6 clf = clf.fit(x,y) ###showing error on compiling
7 print(clf.predict([[-2,1]]))
## output shown
Traceback (most recent call last):
File "naive.py", line 7, in <module>
clf = clf.fit(x,y)
File "/home/abhihsek/.local/lib/python3.6/site-
packages/sklearn/naive_bayes.py", line 192, in fit
sample_weight=sample_weight)
File "/home/abhihsek/.local/lib/python3.6/site-
packages/sklearn/naive_bayes.py", line 371, in _partial_fit
raise ValueError('Number of priors must match number of'
ValueError: Number of priors must match number of classes.
## code of library function line 192
190 X, y = check_X_y(X, y)
191 return self._partial_fit(X, y, np.unique(y),
_refit=True,
192
sample_weight=sample_weight)
## code of library function line 371
369 # Check that the provide prior match the number of classes
370 if len(priors) != n_classes:
371 raise ValueError('Number of priors must
match
number of'
372 ' classes.')
373 # Check that the sum is 1
As #Suvan Pandey mentioned, then the code won't give any error when writing clf = GaussianNB() instead of clf = GaussianNB(x, y).
If we look at the GaussianNB class then the __init__() can take these parameters:
def __init__(self, priors=None, var_smoothing=1e-9): # <-- these have a default value
self.priors = priors
self.var_smoothing = var_smoothing
The documentation about the two parameters:
priors – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
var_smoothing – Portion of the largest variance of all features that is added to variances for calculation stability.
As your x and y variables both return an array object then they don't fit the parameters of the __init__(...).

issues storing and extracting arrays in numpy file

Trying to store an array in numpy file however, while trying to extract it, and use it, getting an error message as trying to apply array to a sequence.
These are the two arrays, unsure which one is causing the issue.
X = [[1,2,3],[4,5,6],[7,8,9]]
y = [0,1,2,3,4,5,6....]
while trying to retrieve it and use it getting the values as:
X: array(list[1,2,3],list[4,5,6],list[7,8,9])
y = array([0,1,2,3,4,5...])
Here is the code:
vectors = np.array(X)
labels = np.array(y)
While retrieving working on t-sne
visualisations = TSNE(n_components=2).fit_transform(X,y)
I get the following error:
ValueError Traceback (most recent call last)
<ipython-input-11-244f99341167> in <module>()
----> 1 visualisations = TSNE(n_components=2).fit_transform(X,y)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\manifold\t_sne.py in fit_transform(self, X, y)
856 Embedding of the training data in low-dimensional space.
857 """
--> 858 embedding = self._fit(X)
859 self.embedding_ = embedding
860 return self.embedding_
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\manifold\t_sne.py in _fit(self, X, skip_num_points)
658 else:
659 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
--> 660 dtype=[np.float32, np.float64])
661 if self.method == 'barnes_hut' and self.n_components > 3:
662 raise ValueError("'n_components' should be inferior to 4 for the "
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: setting an array element with a sequence.
Assuming I understand you correctly you need to package the first group in a list; something like this:
import numpy as np
#X = [[1,2,3],[4,5,6],[7,8,9]]
#y = [0,1,2,3,4,5,6, 7, 8, 9]
X = np.array([[1,2,3],[4,5,6],[7,8,9]])
y = np.array([0,1,2,3,4,5, 6, 7, 8, 9])
array(list[1,2,3],list[4,5,6],list[7,8,9])
is a 1d object dtype array. To get that from
[[1,2,3],[4,5,6],[7,8,9]]
requires more than np.array([[1,2,3],[4,5,6],[7,8,9]]); either the list elements have to vary in size, or you have to initialize an object array and copy the list values to it.
In any case fit_transform cannot handle that kind of array. It expects a 2d numeric dtype. Notice the parameters to the check_array function.
If all the list elements of X are the same size, then
X = np.stack(X)
should turn it into a 2d numeric array.
I suspect X was that 1d object array type before saving. By itself save/load should not turn a 2d numeric array into an object one.

Cosine similarity with word2vec

I load a word2vec-format file and I want to calculate the similarities between vectors, but I don't know what this issue means.
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import KeyedVectors
import numpy as np
model = KeyedVectors.load_word2vec_format('it-vectors.100.5.50.w2v')
similarities = cosine_similarity(model.vectors)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-54-1d4e62f55ebf> in <module>()
----> 1 similarities = cosine_similarity(model.vectors)
/usr/local/lib/python3.5/dist-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output)
923 Y_normalized = normalize(Y, copy=True)
924
--> 925 K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
926
927 return K
/usr/local/lib/python3.5/dist-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
MemoryError:
What it means?
Thank you!
MemoryError means there's not enough memory to complete the operation.
How many vectors are in your 'it-vectors.100.5.50.w2v' set?
Note that cosine_similarity() creates an (n x n) results matrix. So if you have 100,000 vectors in your set, you'll need a results array of size:
100,000^2 * 4 bytes/float = 40GB
Do you have that much addressable memory?

Resources