issues storing and extracting arrays in numpy file - python-3.x

Trying to store an array in numpy file however, while trying to extract it, and use it, getting an error message as trying to apply array to a sequence.
These are the two arrays, unsure which one is causing the issue.
X = [[1,2,3],[4,5,6],[7,8,9]]
y = [0,1,2,3,4,5,6....]
while trying to retrieve it and use it getting the values as:
X: array(list[1,2,3],list[4,5,6],list[7,8,9])
y = array([0,1,2,3,4,5...])
Here is the code:
vectors = np.array(X)
labels = np.array(y)
While retrieving working on t-sne
visualisations = TSNE(n_components=2).fit_transform(X,y)
I get the following error:
ValueError Traceback (most recent call last)
<ipython-input-11-244f99341167> in <module>()
----> 1 visualisations = TSNE(n_components=2).fit_transform(X,y)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\manifold\t_sne.py in fit_transform(self, X, y)
856 Embedding of the training data in low-dimensional space.
857 """
--> 858 embedding = self._fit(X)
859 self.embedding_ = embedding
860 return self.embedding_
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\manifold\t_sne.py in _fit(self, X, skip_num_points)
658 else:
659 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'],
--> 660 dtype=[np.float32, np.float64])
661 if self.method == 'barnes_hut' and self.n_components > 3:
662 raise ValueError("'n_components' should be inferior to 4 for the "
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: setting an array element with a sequence.

Assuming I understand you correctly you need to package the first group in a list; something like this:
import numpy as np
#X = [[1,2,3],[4,5,6],[7,8,9]]
#y = [0,1,2,3,4,5,6, 7, 8, 9]
X = np.array([[1,2,3],[4,5,6],[7,8,9]])
y = np.array([0,1,2,3,4,5, 6, 7, 8, 9])

array(list[1,2,3],list[4,5,6],list[7,8,9])
is a 1d object dtype array. To get that from
[[1,2,3],[4,5,6],[7,8,9]]
requires more than np.array([[1,2,3],[4,5,6],[7,8,9]]); either the list elements have to vary in size, or you have to initialize an object array and copy the list values to it.
In any case fit_transform cannot handle that kind of array. It expects a 2d numeric dtype. Notice the parameters to the check_array function.
If all the list elements of X are the same size, then
X = np.stack(X)
should turn it into a 2d numeric array.
I suspect X was that 1d object array type before saving. By itself save/load should not turn a 2d numeric array into an object one.

Related

sklearn DictVectorizer() throwing error with a dictionary as input

I'm fairly new to sklearn's DictVectorizer, and am trying to create a function where DictVectorizer will output feature names from a list of bigrams that I have used to form a from a feature dictionary. The input to my function is a string, and the function should return a list consisting of a formed into dictionaries (something like this).
def features (str) -> List[Dict[Text, Union[Text, int]]]:
# my feature dictionary should have 'bigram' as the key, and the values will be the bigrams themselves. your feature dict needs to have "bigram" as a key
# bigram: a form of "w[i]-w[i+1]"
# This is my bigram list (as structured above)
bigrams: List[Dict[Text, Union[Text, int]]] = []
# here is my code:
bigrams = {'bigram':i for j in sentence for i in zip(j.split(" ").
[:-1], j.split(" ")[1:])}
return bigrams
vect = DictVectorizer(sparse=False)
text = str()
feature_catalog = features(text)
vect.fit(feature_catalog)
print(sorted(vectorizer.get_feature_names_out()))
Everything works fine until the code advances to the DictVectorizer blocks (hidden in the class itself). This is what I get:
AttributeError Traceback (most recent call last)
/var/folders/pl/k80fpf9s4f9_3rp8hnpw5x0m0000gq/T/ipykernel_3804/266218402.py in <module>
22 features = get_feature(text)
23
---> 24 vectorizer.fit(features)
25
26 print(sorted(vectorizer.get_feature_names()))
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/feature_extraction/_dict_vectorizer.py in fit(self, X, y)
159
160 for x in X:
--> 161 for f, v in x.items():
162 if isinstance(v, str):
163 feature_name = "%s%s%s" % (f, self.separator, v)
AttributeError: 'str' object has no attribute 'items'
Any ideas? This ultimately going to be used as part of a larger processsing effort on a corpus.

I keep getting "TypeError: only integer scalar arrays can be converted to a scalar index" while using custom-defined metric in KNeighborsClassifier

I am using a custom-defined metric in SKlearn's KNeighborsClassifier. Here's my code:
def chi_squared(x,y):
return np.divide(np.square(np.subtract(x,y)), np.sum(x,y))
Above function implementation of chi squared distance function. I have used NumPy functions because according to scikit-learn docs, metric function takes two one-dimensional numpy arrays.
I have passed the chi_squared function as an argument to KNeighborsClassifier().
knn = KNeighborsClassifier(algorithm='ball_tree', metric=chi_squared)
However, I keep getting following error:
TypeError Traceback (most recent call last)
<ipython-input-29-d2a365ebb538> in <module>
4
5 knn = KNeighborsClassifier(algorithm='ball_tree', metric=chi_squared)
----> 6 knn.fit(X_train, Y_train)
7 predictions = knn.predict(X_test)
8 print(accuracy_score(Y_test, predictions))
~/.local/lib/python3.8/site-packages/sklearn/neighbors/_classification.py in fit(self, X, y)
177 The fitted k-nearest neighbors classifier.
178 """
--> 179 return self._fit(X, y)
180
181 def predict(self, X):
~/.local/lib/python3.8/site-packages/sklearn/neighbors/_base.py in _fit(self, X, y)
497
498 if self._fit_method == 'ball_tree':
--> 499 self._tree = BallTree(X, self.leaf_size,
500 metric=self.effective_metric_,
501 **self.effective_metric_params_)
sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._ball_tree.BinaryTree.__init__()
sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._ball_tree.BinaryTree._recursive_build()
sklearn/neighbors/_ball_tree.pyx in sklearn.neighbors._ball_tree.init_node()
sklearn/neighbors/_binary_tree.pxi in sklearn.neighbors._ball_tree.BinaryTree.rdist()
sklearn/neighbors/_dist_metrics.pyx in sklearn.neighbors._dist_metrics.DistanceMetric.rdist()
sklearn/neighbors/_dist_metrics.pyx in sklearn.neighbors._dist_metrics.PyFuncDistance.dist()
sklearn/neighbors/_dist_metrics.pyx in sklearn.neighbors._dist_metrics.PyFuncDistance._dist()
<ipython-input-29-d2a365ebb538> in chi_squared(x, y)
1 def chi_squared(x,y):
----> 2 return np.divide(np.square(np.subtract(x,y)), np.sum(x,y))
3
4
5 knn = KNeighborsClassifier(algorithm='ball_tree', metric=chi_squared)
<__array_function__ internals> in sum(*args, **kwargs)
~/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py in sum(a, axis, dtype, out, keepdims, initial, where)
2239 return res
2240
-> 2241 return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
2242 initial=initial, where=where)
2243
~/.local/lib/python3.8/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
85 return reduction(axis=axis, out=out, **passkwargs)
86
---> 87 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
88
89
TypeError: only integer scalar arrays can be converted to a scalar index
I can reproduce your error message with:
In [173]: x=np.arange(3); y=np.array([2,3,4])
In [174]: np.sum(x,y)
Traceback (most recent call last):
File "<ipython-input-174-1a1a267ebd82>", line 1, in <module>
np.sum(x,y)
File "<__array_function__ internals>", line 5, in sum
File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 2247, in sum
return _wrapreduction(a, np.add, 'sum', axis, dtype, out, keepdims=keepdims,
File "/usr/local/lib/python3.8/dist-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
TypeError: only integer scalar arrays can be converted to a scalar index
Correct use(s) of np.sum:
In [175]: np.sum(x)
Out[175]: 3
In [177]: np.sum(np.arange(6).reshape(2,3), axis=0)
Out[177]: array([3, 5, 7])
In [178]: np.sum(np.arange(6).reshape(2,3), 0)
Out[178]: array([3, 5, 7])
(re)read the np.sum docs if necessary!
Using np.add instead of np.sum:
In [179]: np.add(x,y)
Out[179]: array([2, 4, 6])
In [180]: x+y
Out[180]: array([2, 4, 6])
The following should be equivalent:
np.divide(np.square(np.subtract(x,y)), np.add(x,y))
(x-y)**2/(x+y)

How do I correctly write the syntax for performing and plotting a for loop operation?

I am trying to create a for loop which uses a defined function (B_lambda) and takes in values of wavelength and temperature to produce values of intensity. i.e. I want the loop to take the function B_lambda and to run through every value within my listed wavelength range for each temperature in the temperature list. Then I want to plot the results. I am not very good with the syntax and have tried many ways but nothing is producing what I need and I am mostly getting errors. I have no idea how to use a for loop to plot and all online sources that I have checked out have not helped me with using a defined function in a for loop. I will put my latest code that seems to have the least errors down below with the error message:
import matplotlib.pylab as plt
import numpy as np
from astropy import units as u
import scipy.constants
%matplotlib inline
#Importing constants to use.
h = scipy.constants.h
c = scipy.constants.c
k = scipy.constants.k
wavelengths= np.arange(1000,30000)*1.e-10
temperature=[3000,4000,5000,6000]
for lam in wavelengths:
for T in temperature:
B_lambda = ((2*h*c**2)/(lam**5))*((1)/(np.exp((h*c)/(lam*k*T))-1))
plt.figure()
plt.plot(wavelengths,B_lambda)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-73b866241c49> in <module>
17 B_lambda = ((2*h*c**2)/(lam**5))*((1)/(np.exp((h*c)/(lam*k*T))-1))
18 plt.figure()
---> 19 plt.plot(wavelengths,B_lambda)
20
21
/usr/local/lib/python3.6/dist-packages/matplotlib/pyplot.py in plot(scalex, scaley, data, *args, **kwargs)
2787 return gca().plot(
2788 *args, scalex=scalex, scaley=scaley, **({"data": data} if data
-> 2789 is not None else {}), **kwargs)
2790
2791
/usr/local/lib/python3.6/dist-packages/matplotlib/axes/_axes.py in plot(self, scalex, scaley, data, *args, **kwargs)
1663 """
1664 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D._alias_map)
-> 1665 lines = [*self._get_lines(*args, data=data, **kwargs)]
1666 for line in lines:
1667 self.add_line(line)
/usr/local/lib/python3.6/dist-packages/matplotlib/axes/_base.py in __call__(self, *args, **kwargs)
223 this += args[0],
224 args = args[1:]
--> 225 yield from self._plot_args(this, kwargs)
226
227 def get_next_color(self):
/usr/local/lib/python3.6/dist-packages/matplotlib/axes/_base.py in _plot_args(self, tup, kwargs)
389 x, y = index_of(tup[-1])
390
--> 391 x, y = self._xy_from_xy(x, y)
392
393 if self.command == 'plot':
/usr/local/lib/python3.6/dist-packages/matplotlib/axes/_base.py in _xy_from_xy(self, x, y)
268 if x.shape[0] != y.shape[0]:
269 raise ValueError("x and y must have same first dimension, but "
--> 270 "have shapes {} and {}".format(x.shape, y.shape))
271 if x.ndim > 2 or y.ndim > 2:
272 raise ValueError("x and y can be no greater than 2-D, but have "
ValueError: x and y must have same first dimension, but have shapes (29000,) and (1,)```
First thing to note (and this is minor) is that astropy is not required to run your code. So, you can simplify the import statements.
import matplotlib.pylab as plt
import numpy as np
import scipy.constants
%matplotlib inline
#Importing constants to use.
h = scipy.constants.h
c = scipy.constants.c
k = scipy.constants.k
wavelengths= np.arange(1000,30000,100)*1.e-10 # here, I chose steps of 100, because plotting 29000 datapoints takes a while
temperature=[3000,4000,5000,6000]
Secondly, to tidy up the loop a bit, you can write a helper function, that youn call from within you loop:
def f(lam, T):
return ((2*h*c**2)/(lam**5))*((1)/(np.exp((h*c)/(lam*k*T))-1))
now you can collect the output of your function, together with the input parameters, e.g. in a list of tuples:
outputs = []
for lam in wavelengths:
for T in temperature:
outputs.append((lam, T, f(lam, T)))
Since you vary both wavelength and temperature, a 3d plot makes sense:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')
ax.plot(*zip(*outputs))
An alternative would be to display the data as an image, using colour to indicate the function output.
I am also including an alternative method to generate the data in this one. Since the function f can take arrays as input, you can feed one temperature at a time, and with it, all the wavelengths simultaneously.
# initialise output as array with proper shape
outputs = np.zeros((len(wavelengths), len(temperature)))
for i, T in enumerate(temperature):
outputs[:,i] = f(wavelengths, T)
The output now is a large matrix, which you can visualise as an image:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.imshow(outputs, aspect=10e8, interpolation='none',
extent=[
np.min(temperature),
np.max(temperature),
np.max(wavelengths),
np.min(wavelengths)]
)

Don't understand error message (basic sklearn command)

I'm new to Python and programming in general and I wanted to exercise a littlebit with linear regression in one variable.
Im currently following this tutorial in the link
https://www.youtube.com/watch?v=8jazNUpO3lQ&list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw&index=2
and I am exactly doing what he is doing.
I did however encounter an error when compiling as shown in the code below
(for simplicity, I put '--' to places which is the output. I used Jupyter Notebook)
At the end I encounterd a long list of errors when trying to compile 'reg.predict(3300)'.
I don't understand what went wrong.
Can someone help me out?
Cheers!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
df = pd.read_csv("homeprices.csv")
df
--area price
0 2600 550000
1 3000 565000
2 3200 610000
3 3600 680000
4 4000 725000
%matplotlib inline
plt.xlabel('area(sqr ft)')
plt.ylabel('price(US$)')
plt.scatter(df.area, df.price, color='red', marker = '+')
--<matplotlib.collections.PathCollection at 0x2e823ce66a0>
reg = linear_model.LinearRegression()
reg.fit(df[['area']],df.price)
--LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
reg.predict(3300)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-ad5a8409ff75> in <module>
----> 1 reg.predict(3300)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
211 Returns predicted values.
212 """
--> 213 return self._decision_function(X)
214
215 _preprocess_data = staticmethod(_preprocess_data)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
194 check_is_fitted(self, "coef_")
195
--> 196 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
197 return safe_sparse_dot(X, self.coef_.T,
198 dense_output=True) + self.intercept_
~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
543 "Reshape your data either using array.reshape(-1, 1) if "
544 "your data has a single feature or array.reshape(1, -1) "
--> 545 "if it contains a single sample.".format(array))
546 # If input is 1D raise error
547 if array.ndim == 1:
ValueError: Expected 2D array, got scalar array instead:
array=3300.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Try reg.predict([[3300]]). The api used to allow scalar value but now you need to give 2D array
reg.fit(df[['area']],df.price)
I think above we are using 2 variables, so using 2D array to fit [X]. we need to use 2D array in reg.predict for [X],too. Hence,
reg.predict([[3300]])
Expected 2D array,got scalar array instead: this is written in the error explained box so
kindly change it to :
just wrote it like this
reg.predict([[3300]])

pass a list as argument of the func1d in numpy.apply_along_axis(func1d, axis, arr, *args, **kwargs)

I don't manage to pass a list as arguments to func1d in numply.apply_along_axis(...).
def test(a, value):
print(value)
return a
a = np.zeros((49), dtype=list)
kwargs = {"value":[1,1,1]}
zep = np.vectorize(test)
np.apply_along_axis(zep, 0, a, **kwargs)
Out:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/ibpc/osx/lbt/numpy/1.9.2/lib/python3.4/site-packages/nump/lib/shape_base.py", line 91, in apply_along_axis
res = func1d(arr[tuple(i.tolist())], *args, **kwargs)
File "/ibpc/osx/lbt/numpy/1.9.2/lib/python3.4/site-packages/numpy/lib/function_base.py", line 1700, in __call__
return self._vectorize_call(func=func, args=vargs)
File "/ibpc/osx/lbt/numpy/1.9.2/lib/python3.4/site-packages/numpy/lib/function_base.py", line 1769, in _vectorize_call
outputs = ufunc(*inputs)
ValueError: operands could not be broadcast together with shapes (49,) (3,)
So, He want that my len(kwargs["value"])==49. But it's not what I want.
I need to change value if I want (during the numpy.apply_along_axis(func1d) I need to update my list).
How can I pass a list as argument? Or may be use another way to resolve this problem..
In real, I have a numpy.array of list of position in 3Dspace for a particle.
Like this:
dim = [49,49,49]
dx = 3
origin = [3,3,3]
nb_iter = 5
ntoto=np.load("ntoto.npy")
ntoto = ntoto.flatten()
liste_particles=np.zeros((5), dtype=list)
for i in range(len(liste_particles)):
liste_particles[i]=[[r.uniform(0,150),r.uniform(0,150),r.uniform(0,150)]]*nb_iter #nb_iter is just the number of iteration I want to do in calcTrajs.
vtraj=np.vectorize(calcTrajs, otypes=[list])
np.apply_along_axis(vtraj, 0, liste_particules)
Here, I have five particles randomly place. Moreover, I have another numpy.array (shape==(49,49,49)) which contains a vector_field.
Here the func1d which I need to run:
def calcTrajs(a):
global ntoto, dim, dx, origin #ntoto is my vector_field
for b in range(1,len(a)):
ijk = s2g(a[b-1], dx, origin, dim) # function to have on which vector my particle is.(space to grid, because my vector_field is like a grid).
value = np.asarray(ntoto[flatten3Dto1D(ijk, dim[1], dim[2])]) # so value contains the vector who influence my particle.
try:
a[b] = list(a[b-1] + value*1000)
except:
print("error")
break
return a
this function permits me to launch a particle in my vector_field and calculate its trajectory.
As you can see, I put global variables. But I want to pass this variables as arguments and not as global. ntoto is a numpy.array, dim is a list (dimension of my vector field), dx is the cell spacing (because my vector_field is in a grid which contains many cells and each cell contains a vector) and origin is the first point of my grid.
Best regards,
Adam
As I commented, neither vectorize or apply... is a speed tool. vectorize can be useful for broadcasting several arrays against each other. apply ... can be useful for iterating over more than 2 dimensions. With only one or two it is overkill. Both are tools that beginners often misuse.
It looks like the apply_along_axis part is ok, though I haven't tested it. The error lies in broadcasting in vectorize.
Especially since you are defining a as object dtype, you should specify like return dtype for vectorize. Otherwise it performs a test calc to determine it.
In [223]: def test(a, value):
...: print(value)
...: return a
In [224]: zep = np.vectorize(test, otypes=['O'])
In [225]: a = np.array([[1,2,3],[4,5]])
In [226]: a
Out[226]: array([list([1, 2, 3]), list([4, 5])], dtype=object)
zep works with a and a scalar
In [227]: zep(a,1)
1
1
Out[227]: array([list([1, 2, 3]), list([4, 5])], dtype=object)
But when a has 2 items, and value as 3 items, I get the same sort of error as you did:
In [228]: zep(a,[1,2,3])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-228-382aaa7a2dc6> in <module>()
----> 1 zep(a,[1,2,3])
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py in __call__(self, *args, **kwargs)
2753 vargs.extend([kwargs[_n] for _n in names])
2754
-> 2755 return self._vectorize_call(func=func, args=vargs)
2756
2757 def _get_ufunc_and_otypes(self, func, args):
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py in _vectorize_call(self, func, args)
2829 for a in args]
2830
-> 2831 outputs = ufunc(*inputs)
2832
2833 if ufunc.nout == 1:
ValueError: operands could not be broadcast together with shapes (2,) (3,)
(2,) and (2,) is fine:
In [229]: zep(a,['a','b'])
a
b
Out[229]: array([list([1, 2, 3]), list([4, 5])], dtype=object)
So is (2,) with (2,1), producing a (2,2) output. This is an example of the kind of broadcasting where vectoring can help.
In [230]: zep(a,[['a'],['b']])
a
a
b
b
Out[230]:
array([[list([1, 2, 3]), list([4, 5])],
[list([1, 2, 3]), list([4, 5])]], dtype=object)

Resources