Firstly I'm running on Ubuntu and using Anaconda for my kernels in Jupyter Notebook.
I keep getting the following error. I believe it suggests I'm running out of memory but I'm not sure how to solve this. Someone suggested uninstalling Python3 32 and installing Python3 64 but this broke my Ubuntu installation immediately after uninstalling python (whoops) and I had to reinstall.
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-25-8b765500013f> in <module>()
----> 1 _, S, _ = np.linalg.svd(features)
2 print("Condition number from the SVD: {0:.1f}".format(np.max(S)/np.min(S)))
3 print("Condition number from cond: {0:.1f}".format(np.linalg.cond(features)))
/home/n/anaconda/lib/python3.5/site-packages/numpy/linalg/linalg.py in svd(a, full_matrices, compute_uv)
1387
1388 signature = 'D->DdD' if isComplexType(t) else 'd->ddd'
-> 1389 u, s, vt = gufunc(a, signature=signature, extobj=extobj)
1390 u = u.astype(result_t, copy=False)
1391 s = s.astype(_realType(result_t), copy=False)
MemoryError:
I'm trying to process a dataset with a shape (91946, 171), all numerical, using the code below.
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
pd.set_option('display.max_columns',180)
dfk = pd.read_csv('data/kick.csv')dfk = pd.read_csv('data/kick.csv')
response = dfk.state
features = dfk
features.drop('state', axis=1, inplace=True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, response,
random_state=321)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
ypred_lr = logreg.predict(X_test)
print("Accuracy on the test set: {0:.3f}".format(accuracy_score(ypred_lr, y_test)))
_, S, _ = np.linalg.svd(features) #fails here
print("Condition number from the SVD: {0:.1f}".format(np.max(S)/np.min(S)))
print("Condition number from cond: {0:.1f}".format(np.linalg.cond(features)))
Related
I am trying to balance my classes in the dataset, but I am receiving an error after trying to apply the SVMSMOTE algorithm. I am receiving this full error below. I am hoping someone could help me figure out where I am going wrong.
ValueError Traceback (most recent call last)
<ipython-input-150-d907dac94024> in <module>()
2 svmsmote = SVMSMOTE(random_state = 101)
3
----> 4 X, y = svmsmote.fit_resample(X,y)
5
6 sns.countplot(y)
5 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
806 "Found array with %d sample(s) (shape=%s) while a"
807 " minimum of %d is required%s."
--> 808 % (n_samples, array.shape, ensure_min_samples, context)
809 )
810
ValueError: Found array with 0 sample(s) (shape=(0, 18)) while a minimum of 1 is required.
For everyone else's sake, I am including the code that is producing the error.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn import preprocessing
from google.colab import drive
drive.mount('/content/drive')
dataset = pd.read_csv("/content/drive/My Drive/IoT-23 _Final_Zeros.csv")
#dataset = pd.read_csv("/content/drive/My Drive/CYB509 Project/IoT-23 - 100 - Rows.csv")
X = dataset.iloc[:, :-1].values;
y = dataset.iloc[:, -1].values;
le = preprocessing.LabelEncoder()
for i in range(len(X[0])):
X[:,i] = le.fit_transform(X[:,I])
y = le.fit_transform(y)
import seaborn as sns
sns.countplot(y='Labels', data=dataset)
from imblearn.over_sampling import SVMSMOTE
svmsmote = SVMSMOTE(random_state = 101)
X, y = svmsmote.fit_resample(X,y) #error happens here.
sns.countplot(y)
valueError image part
Here is the code for eliminating features where I am getting value errors. I want to use recursive feature elimination without specifying any features . I tried to use the RFE(Recursion feature elemination) model to automatically eliminate weak features with each iteration which I have unable to do.HERE is the link of the dataset. https://drive.google.com/file/d/1neYnunu6a_Mdn3NfRZsF8wE4gwMCpjAY/view?usp=sharing .I will be grateful if you suggest me how to do it.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import DataFrame
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn import datasets
from sklearn.metrics import classification_report
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
df.keys()
x=pd.DataFrame(df)
x.head()
X = df.drop(["Sub_Cat"],axis=1).values
y = df["Sub_Cat"].values
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
X_train.shape,X_test.shape
sel=SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1))
sel.fit(X_train,y_train)
sel.get_support()
[I am getting value error in this part][1]
Then i tried to do this also getting `X = df.drop(["Dst_IP","Timestamp","Flow_ID","Src_IP","Sub_Cat"],axis=1).values
y = df["Sub_Cat"].values
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
X = df.drop(["Dst_IP","Timestamp","Flow_ID","Src_IP","Sub_Cat"],axis=1).values
y = df["Sub_Cat"].values
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
sel=SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1))
sel.fit(X_train,y_train)
sel.get_support()
I am still getting error:
ValueError Traceback (most recent call last)
in ()
1 sel=SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1))
----> 2 sel.fit(X_train,y_train)
3 sel.get_support()
3 frames
/usr/local/lib/python3.7/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
81
82 """
---> 83 return array(a, dtype, copy=False, order=order)
84
85
ValueError: could not convert string to float: 'Anomaly'
I am trying to use LinearRegression on a data set using Python 3. I am trying to see the influence of Order Size on the metric OTIF (On Time In Full). The metric is a percentage of the amount of deliveries delivered in on time and in full. I get an error when I try to use LinearRegression.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# path of data
path = 'c:/Data/ame.csv'
df = pd.read_csv(path)
df.head()
from sklearn.linear_model import LinearRegression
lm = LinearRegression
lm
X = df[['Order Units']]
Y = df['OTIF%']
lm.fit(X,Y)
Yhat=lm.predict(X)
Yhat[0:5]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-39-b4b21bd8b84e> in <module>
----> 1 Yhat=lm.predict(X)
2 Yhat[0:5]
TypeError: predict() missing 1 required positional argument: 'X'
I think issue is you are not creating LinearRegression object for you.you must call its own constructor to get a object of the class.try this.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['Order Units']]
Y = df['OTIF%']
lm.fit(X,Y)
Yhat=lm.predict(X)
I know that in python3 ".has_khey" is replace by "in"
But in my exemple , i didn't manage for make it working .
the whole quote for execution
from sklearn import model_selection
import pandas as pd
import numpy as np
from sklearn import neighbors, metrics
from matplotlib import pyplot as plt
data = pd.read_csv('your_path/winequality-red.csv', sep=";")
X = data.as_matrix([data.columns[:-1]])
y = data.as_matrix([data.columns[-1]])
y.flatten()
X_train, X_test, y_train, y_test = \
model_selection.train_test_split(X,y, test_size=0.3)
knn= neighbors.KNeighborsRegressor(n_neighbors = 12)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
the part which return me error :
sizes = {}
for (yt, yp) in zip(list(y_test), list(y_pred)):
if sizes.has_key((yt, yp)):
sizes[(yt, yp)] += 1
else:
sizes[(yt, yp)] = 1
keys = sizes.keys()
plt.scatter([k[0] for k in keys], [k[1] for k in keys], s=[sizes[k] for k in keys], color='coral')
when i try to swap if sizes.has_key((yt, yp)): in if (yt, yp) in sizes:
I got the error : TypeError: unhashable type: 'numpy.ndarray'
download the wine database
thanks in advance for any help
the result i'm looking for :
plot scatter size
here the .ipynb or .py file
I don't think the code you show can actually produce the error you report. Possibly you have redefined some variable in the notebook outside of that code?
In any case, concerning the question, you would want to replace if sizes.has_key((yt, yp)): by
if (yt, yp) in sizes.keys():
This should give you the desired plot
I have read quite a bit on this particular error and haven't been able to find an answer that addresses my issue. I have a data set that I have split into train and test sets and am looking to run a KNeighborsClassifier. My code is below... My problem is that when I look at the dtypes of my X_train i don't see any string formatted columns at all. My y_train is a single categorical variable. This is my first stackoverflow post so my apologies if I've overlooked any formalities and thanks for the help! :)
Error:
TypeError: unorderable types: str() > float()
Dtypes:
X_train.dtypes.value_counts()
Out[54]:
int64 2035
float64 178
dtype: int64
Code:
# Import Packages
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.dummy import DummyRegressor
from sklearn.cross_validation import train_test_split, KFold
from matplotlib.ticker import FormatStrFormatter
from sklearn import cross_validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import pdb
# Set Directory Path
path = "file_path"
os.chdir(path)
#Select Import File
data = 'RawData2.csv'
delim = ','
#Import Data File
df = pd.read_csv(data, sep = delim)
print (df.head())
df.columns.get_loc('Categories')
#Model
#Select/Update Features
X = df[df.columns[14:2215]]
#Get Column Index for Target Variable
df.columns.get_loc('Categories')
#Select Target and fill na's with "Small" label
y = y[y.columns[21]]
print(y.values)
y.fillna('Small')
#Training/Test Set
X_sample = X.loc[X.Var1 <1279]
X_valid = X.loc[X.Var1 > 1278]
y_sample = y.head(len(X_sample))
y_test = y.head(len(y)-len(X_sample))
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size = 0.2)
cv = KFold(n = X_train.shape[0], n_folds = 5, random_state = 17)
print(X_train.shape, y_train.shape)
X_train.dtypes.value_counts()
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train) **<-- This is where the error is flagged**
accuracy_score(knn.predict(X_test))
Everything in sklearn is based on numpy which only uses numbers. Hence categorical X and Y need to be encoded as numbers. For x you can use get_dummies. For y you can use LabelEncoder.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html