ValueError after MinMaxScaler and Transform - python-3.x

I am experiencing difficulty in this area. I experienced ValueError in the following: (I have tried solutions online but to no avail)
Here's my original code, which returns Convert String to Float error
ValueError: could not convert string to float: '3,1,0,0,0,1,0,1,89874,49.99'):
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
training_data_df = pd.read_csv('./data/sales_data_training.csv')
scaler = MinMaxScaler(feature_range=(0,1))
scaled_training= scaler.fit_transform(training_data_df)
scaled_training_df = pd.DataFrame(scaled_training,columns= training_data_df.columns.values)
My CSV Data:
"critic_rating,is_action,is_exclusive_to_us,is_portable,is_role_playing,is_sequel,is_sports,suitable_for_kids,total_earnings,unit_price"
"3.5,1,0,1,0,1,0,0,132717,59.99"
"4.5,0,0,0,0,1,1,0,83407,49.99"...
'3,1,0,0,0,1,0,1,89874,49.99'
I have 9 columns of data across 1000 rows (~9999 data, with first row being the header).
Regards,
Yuki
The full error is as follows:
Traceback (most recent call last):
File "C:/Users/YukiKawaii/PycharmProjects/PandasTest/module2_NN/test.py", line 6, in <module>
scaled_training= scaler.fit_transform(training_data_df)
File "C:\Users\YukiKawaii\Python\Python35\lib\site-packages\sklearn\base.py", line 517, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "C:\Users\YukiKawaii\Python\Python35\lib\site-packages\sklearn\preprocessing\data.py", line 308, in fit
return self.partial_fit(X, y)
File "C:\Users\YukiKawaii\Python\Python35\lib\site-packages\sklearn\preprocessing\data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "C:\Users\YukiKawaii\Python\Python35\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: '3,1,0,0,0,1,0,1,89874,49.99'

You should remove the "" and '' wrapped around each line in the csv file.
By default pd.read_csv() splits each line by , and thus it cannot convert strings to floats if the "" and '' were there.
So the csv file should look as follows.
critic_rating,is_action,is_exclusive_to_us,is_portable,is_role_playing,is_sequel,is_sports,suitable_for_kids,total_earnings,unit_price
3.5,1,0,1,0,1,0,0,132717,59.99
4.5,0,0,0,0,1,1,0,83407,49.99
3,1,0,0,0,1,0,1,89874,49.99
I just verified by running your code after making the above change.

Related

Copy paste a Naive Bayes example code on vscode but got errors

I copied the code from datacamp to try the Naive Bayes classification on my own on python 3.8 . but when run the code the compiler gives this error
Traceback (most recent call last):
File "c:\Users\USER\Desktop\DATA MINING\NaiveTest.py", line 34, in <module>
model.fit(features,label)
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\naive_bayes.py", line 207, in fit
X, y = self._validate_data(X, y)
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\base.py", line 433, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\validation.py", line 814, in check_X_y
X = check_array(X, accept_sparse=accept_sparse,
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\USER\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\validation.py", line 630, in check_array
raise ValueError(
ValueError: Expected 2D array, got scalar array instead:
array=<zip object at 0x0F2C4C28>.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I am posting the whole code cause I'm not sure which part that causes this so I'm requesting help to solve this.
# Assigning features and label variables
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny','Rainy','Sunny','Overcast','Overcast','Rainy']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']
play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)
print (weather_encoded)
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)
print ("Temp:",temp_encoded)
print ("Play:",label)
#Combinig weather and temp into single listof tuples
features=zip(weather_encoded,temp_encoded)
print(list(zip(weather_encoded,temp_encoded)))
print([i for i in zip(weather_encoded,temp_encoded)])
from sklearn.naive_bayes import GaussianNB
#Create a Gaussian Classifier
model = GaussianNB()
# Train the model using the training sets
model.fit(features,label)
#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print ("Predicted Value:", predicted)
supposedly the result something like this Predicted Value: [1]
but it gave this error instead
What happens is that features should be a list to be passed to model.fit, currently they are type zip
#Combinig weather and temp into single listof tuples
features=zip(weather_encoded,temp_encoded)
you may need to convert features to list, e.g.
#Combinig weather and temp into single listof tuples
features=list(zip(weather_encoded,temp_encoded))

KeyError: '[...] not in index' occurs when train/test sets are split manually into two files

I get the error KeyError: '[...] Not in index' when using an sklearn hyperopt regression example on my dataset.
I have seen other answers to this problem where the solution was that, e.g, X_train should be set to X_train = X.iloc[train_indices] and the lack of iloc usage was the issue. But in my problem, I have manually split my dataset into two files so I don't need do do any slicing or indexing. I used a different script to take a big data set and split it into a train set file and a test set file. These files do not have index columns and are only numeric. If you are wondering about the data set it is from UCI and called the protein physiochemical dataset.
from hpsklearn import HyperoptEstimator, any_regressor, xgboost_regression
from sklearn.datasets import load_iris
from hyperopt import tpe
import numpy as np
import pandas as pd
# Download the data and split into training and test sets
X_train = pd.read_csv('data2/CASP_train.csv')
X_test = pd.read_csv('data2/CASP_test.csv')
y_train = X_train['Y']
y_test = X_test['Y']
X_train.drop('Y',axis=1,inplace=True)
X_test.drop('Y',axis=1,inplace=True)
print(list(X_test))
#X_train.drop(list(X_train)[0],axis=1,inplace=True)
#X_test.drop(list(X_test)[0],axis=1,inplace=True)
print(list(X_test))
print(X_train)
# Instantiate a HyperoptEstimator with the search space and number of evaluations
estim = HyperoptEstimator(regressor=xgboost_regression('xgreg'),
preprocessing=('my_pre'),
algo=tpe.suggest,
max_evals=100,
trial_timeout=120)
estim.fit(X_train, y_train)
print(estim.score(X_test, y_test))
print(estim.best_model())
The full full traceback is as follows
Traceback (most recent call last):
File "PRSAXGB.py", line 30, in <module>
estim.fit(X_train, y_train)
File "/home/rj/anaconda3/lib/python3.6/site-packages/hpsklearn/estimator.py", line 783, in fit
fit_iter.send(increment)
File "/home/rj/anaconda3/lib/python3.6/site-packages/hpsklearn/estimator.py", line 693, in fit_iter
return_argmin=False, # -- in case no success so far
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 389, in fmin
show_progressbar=show_progressbar,
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/base.py", line 643, in fmin
show_progressbar=show_progressbar)
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 408, in fmin
rval.exhaust()
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 262, in exhaust
self.run(self.max_evals - n_done, block_until_done=self.asynchronous)
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 227, in run
self.serial_evaluate()
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 141, in serial_evaluate
result = self.domain.evaluate(spec, ctrl)
File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/base.py", line 848, in evaluate
rval = self.fn(pyll_rval)
File "/home/rj/anaconda3/lib/python3.6/site-packages/hpsklearn/estimator.py", line 656, in fn_with_timeout
raise fn_rval[1]
KeyError: '[ 0 1 2 ... 29264 29265 29266] not in index'
The solution was to do
estim.fit(X_train.values, y_train.values)

Problem in reshaping numpy array of large size

I have a numpy array with following dimensions :
(1611216, 2)
I tried reshaping it to (804, 2004)
using :
df = np.reshape(df, (804, 2004))
but it gives an error :
Traceback (most recent call last):
File "Z:/Seismic/Geophysical/99_Personal/Abhishake/RMS_Machine_learning/RMS_data_analysis.py", line 19, in <module>
df = np.reshape(df, (804, 2004))
File "C:\python36\lib\site-packages\numpy\core\fromnumeric.py", line 232, in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)
File "C:\python36\lib\site-packages\numpy\core\fromnumeric.py", line 57, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 3222432 into shape (804,2004)
df = np.reshape(df, (804, 2004))
but it gives an error :
Traceback (most recent call last):
File "Z:/Seismic/Geophysical/99_Personal/Abhishake/RMS_Machine_learning/RMS_data_analysis.py", line 19, in
df = np.reshape(df, (804, 2004))
File "C:\python36\lib\site-packages\numpy\core\fromnumeric.py", line 232, in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)
File "C:\python36\lib\site-packages\numpy\core\fromnumeric.py", line 57, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 3222432 into shape (804,2004)
You cannot reshape (1611216, 2) numpy array into (804, 2004).
It is because 1611216 x 2 = 3,222,432 and 804 x 2004 = 1,611,216. The difference in size of the two array is very large. I think you have to come up with another set of dimensions for your numpy array and that would depend on how you want to use your array.
Hint : (1608, 2004) will be a valid reshape.

Why I am wrong to concatenate matrix and vector?

This is my code in theano
max_max=200
beReplaced=T.matrix()
toReplace=T.matrix()
timeArray=T.arange(max_max)
def f(v,k,w):
return T.concatenate([w[:k],v,w[k+1:]],axis=0)
result,_=theano.scan(f,
sequences=[toReplace,timeArray],
outputs_info=beReplaced)
What I am trying to do is replace beReplaced with toReplace line by line. The way I do it is by concatenate the upper part of w, v and lower parter of w.
vis lines of toReplace.
Here is the error report
Traceback (most recent call last):
File "/Users/qiansteven/Desktop/NLP/RNN/my.py", line 20, in <module>
outputs_info=np.zeros((5,5),dtype=np.float64))
File "/usr/local/lib/python2.7/site-packages/theano/scan_module/scan.py", line 745, in scan
condition, outputs, updates = scan_utils.get_updates_and_outputs(fn(*args))
File "/Users/qiansteven/Desktop/NLP/RNN/my.py", line 16, in f
return T.concatenate([a,b,c],axis=0)
File "/usr/local/lib/python2.7/site-packages/theano/tensor/basic.py", line 4225, in concatenate
return join(axis, *tensor_list)
File "/usr/local/lib/python2.7/site-packages/theano/gof/op.py", line 611, in __call__
node = self.make_node(*inputs, **kwargs)
File "/usr/local/lib/python2.7/site-packages/theano/tensor/basic.py", line 3750, in make_node
axis, tensors, as_tensor_variable_args, output_maker)
File "/usr/local/lib/python2.7/site-packages/theano/tensor/basic.py", line 3816, in _make_node_internal
raise TypeError("Join() can only join tensors with the same "
TypeError: Join() can only join tensors with the same number of dimensions.
What's wrong???????????
Put toReplace into non_sequences, otherwise each timestep will only take a slice of it. Theano will report error when it tries to concatenate a vector with matrix.
def f(k,w,v): #NOTE the argument order change
return T.concatenate([w[:k],v,w[k+1:]],axis=0)
result,_=theano.scan(f,
sequences=timeArray,
outputs_info=beReplaced,
non_sequences=toReplace)
The solution is to concatenate v.dimshuffle('x',0) and that solves the dim problem.

Strange "reindexing error" converting Series to DataFrame

I have two Series objects, which from my perspective look exactly the same, except they contain different data. I have attempted to convert them to DataFrames and to put them both in the same DataFrame as separate columns. For some reason I cannot fathom, one of the Series will be converted happily to a DataFrame and the other one refuses to be converted when placed in a container (list or dict). I get a reindexing error, but there are no duplicates in the index of either Series.
import pickle
import pandas as pd
s1 = pickle.load(open('s1.p', 'rb'))
s2 = pickle.load(open('s2.p', 'rb'))
print(s1.head(10))
print(s2.head(10))
pd.DataFrame(s1) # <--- works fine
pd.DataFrame(s2) # <--- works fine
pd.DataFrame([s1]) # <--- works fine
# pd.DataFrame([s2]) # <--- doesn't work
# pd.DataFrame([s1, s2]) # <--- doesn't work
pd.DataFrame({s1.name: s1}) # <--- works fine
pd.DataFrame({s2.name: s2}) # <--- works fine
pd.DataFrame({s1.name: s1, s2.name: s1}) # <--- works fine
# pd.DataFrame({s1.name: s1, s2.name: s2}) # <--- doesn't work
Here is the output, although you can't see it here, there is overlap between the index values; they are just in a different order. I want the indexes to be matched when I combine them into the same DataFrame.
id
801120 42.01
801138 50.18
801139 50.01
802101 53.77
802110 56.52
802112 47.37
802113 46.52
802114 46.58
802115 42.59
802117 40.85
Name: age, dtype: float64
id
A32067 0.39083
A32195 0.28506
A01685 0.36432
A11124 0.55649
A32020 0.41524
A32021 0.43788
A32098 0.49206
A00699 0.37515
A32158 0.58793
A14139 0.47413
Name: lh_vtx_000001, dtype: float64
Traceback when the final line is uncommented:
Traceback (most recent call last):
File "/Users/sm2286/Documents/Vertex/test.py", line 18, in <module>
pd.DataFrame({s1.name: s1, s2.name: s2}) # <--- doesn't work
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 224, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 360, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 5236, in _arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 5534, in _homogenize
v = v.reindex(index, copy=False)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/series.py", line 2287, in reindex
return super(Series, self).reindex(index=index, **kwargs)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 2229, in reindex
fill_value, copy).__finalize__(self)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 2247, in _reindex_axes
copy=copy, allow_dups=False)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 2341, in _reindex_with_indexers
copy=copy)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py", line 3586, in reindex_indexer
self.axes[axis]._can_reindex(indexer)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py", line 2293, in _can_reindex
raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
Traceback when line 13 is uncommented:
Traceback (most recent call last):
File "/Users/sm2286/Documents/Vertex/test.py", line 13, in <module>
pd.DataFrame([s2]) # <--- doesn't work
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 263, in __init__
arrays, columns = _to_arrays(data, columns, dtype=dtype)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 5359, in _to_arrays
dtype=dtype)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 5453, in _list_of_series_to_arrays
indexer = indexer_cache[id(index)] = index.get_indexer(columns)
File "/Users/sm2286/anaconda3/lib/python3.5/site-packages/pandas/indexes/base.py", line 2082, in get_indexer
raise InvalidIndexError('Reindexing only valid with uniquely'
pandas.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
After more investigation the difference between the Series was that the latter contained missing values. Removing them fixed the issue.

Resources