I am using Python to implement linear regression on some dataset, but on this step I am continously getting this error - python-3.x

I wrote this linear regression code and now it is giving me an error:
at def iterate_weights function.error = index 200 is out of bounds for
axis 0 with size 200
I don't know what is wrong. Also when I am uploading my weights they are coming the same as above which I chose at random. I am using Jupyter notebook.
Are there any mistakes?
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
#importing dataset
data = pd.read_csv('F:\WOC\linearreg.csv')
print(data.shape)
data.head()
data_arr = np.genfromtxt("F:\WOC\linearreg.csv", delimiter=",", skip_header=1)
print(data_arr)
# In[3]:
#collecting x and y
x_train = data_arr[:,1:4]
y_train = data_arr[:,4:5]
print(x_train)
print(y_train)
# In[4]:
weights_shape = y_train.shape
print(weights_shape)
r,c = x_train.shape
print(r,c)
w = np.random.randn(c,1)
w_num = len(w)
print(w)
# In[5]:
h = np.dot(x_train,w)
def cost_function():
print(h)
j = (1/2*r)*((h-y_train)**2)
print('j',j)
cost_function()
# In[6]:
def iterate_weights():
L=0.01
iterations = 1000
for iterations_proceed in range(1,1001):
for i in range(w_num):
for m in range(1,201):
w[i,0] = w[i,0]-L*((1/r)*(sum(h-y_train)*(x_train[m,i])))
print(w)
iterate_weights()
# In[7]:
h = np.dot(x_train,w)
def cost_function1():
j = np.sum((1/2*r)*((h-y_train)**2))
print(j)

Related

Predicting classes in MNIST dataset with a Gaussian- the same prediction errors with different paramemters?

I am trying to find the best c parameter following the instructions to a task that asks me to ' Define a function, fit_generative_model, that takes as input a training set (train_data, train_labels) and fits a Gaussian generative model to it. It should return the parameters of this generative model; for each label j = 0,1,...,9, where
pi[j]: the frequency of that label
mu[j]: the 784-dimensional mean vector
sigma[j]: the 784x784 covariance matrix
It is important to regularize these matrices. The standard way of doing this is to add cI to them, where c is some constant and I is the 784-dimensional identity matrix. c is now a parameter, and by setting it appropriately, we can improve the performance of the model.
%matplotlib inline
import sys
import matplotlib.pyplot as plt
import gzip, os
import numpy as np
from scipy.stats import multivariate_normal
if sys.version_info[0] == 2:
from urllib import urlretrieve
else:
from urllib.request import urlretrieve
# Downloads the dataset
def download(filename, source='http://yann.lecun.com/exdb/mnist/'):
print("Downloading %s" % filename)
urlretrieve(source + filename, filename)
# Invokes download() if necessary, then reads in images
def load_mnist_images(filename):
if not os.path.exists(filename):
download(filename)
with gzip.open(filename, 'rb') as f:
data = np.frombuffer(f.read(), np.uint8, offset=16)
data = data.reshape(-1,784)
return data
def load_mnist_labels(filename):
if not os.path.exists(filename):
download(filename)
with gzip.open(filename, 'rb') as f:
data = np.frombuffer(f.read(), np.uint8, offset=8)
return data
## Load the training set
train_data = load_mnist_images('train-images-idx3-ubyte.gz')
train_labels = load_mnist_labels('train-labels-idx1-ubyte.gz')
## Load the testing set
test_data = load_mnist_images('t10k-images-idx3-ubyte.gz')
test_labels = load_mnist_labels('t10k-labels-idx1-ubyte.gz')
train_data.shape, train_labels.shape
So I have written this code for three different C-values. they each give me the same error?
def fit_generative_model(x,y):
lst=[]
for c in [20,200, 4000]:
k = 10 # labels 0,1,...,k-1
d = (x.shape)[1] # number of features
mu = np.zeros((k,d))
sigma = np.zeros((k,d,d))
pi = np.zeros(k)
for label in range(0,k):
indices = (y == label)
mu[label] = np.mean(x[indices,:], axis=0)
sigma[label] = np.cov(x[indices,:], rowvar=0, bias=1) + c*np.identity(784) # I define the identity matrix
predictions = np.argmax(score, axis=1)
errors = np.sum(predictions != y)
lst.append(errors)
print(c,"Model makes " + str(errors) + " errors out of 10000", lst)
Then I fit it to the training data and get these same errors:
mu, sigma, pi = fit_generative_model(train_data, train_labels)
20 Model makes 1 errors out of 10000 [1]
200 Model makes 1 errors out of 10000 [1, 1]
4000 Model makes 1 errors out of 10000 [1, 1, 1]
and to the test data:
mu, sigma, pi = fit_generative_model(test_data, test_labels)
20 Model makes 9020 errors out of 10000 [9020]
200 Model makes 9020 errors out of 10000 [9020, 9020]
4000 Model makes 9020 errors out of 10000 [9020, 9020, 9020]
What is it I'm doing wrong? the correct answer is c=4000 which yields an error of ~4.3%.

Argument must be a string or a number issue, Not 'Type' - Pyspark

Update:
So i have been looking into the issue, the problem is with scikit-multiflow datastream. in last quarter of code stream_clf.partial_fit(X,y, classes=stream.target_values) here the class valuefor stream.target_values should a number or string, but the method is returning (dtype). When i print or loop stream.target_values i get this:
I have tried to do conversion etc. but still of no use. can someone please help here ?
Initial Problem
I am running a code (took inspiration from here). It works perfectly alright when used vanilla python environment.
But if i run this code after certain modification in Apache Spark using Pyspark , i get the following error
TypeError: int() argument must be a string, a bytes-like object or a number, not 'type'
I have tried every possibile way to trace the issue but everything looks alright. The error arises from the last line of the code where hoefding tree is called for prediction. It expects an ndarray and the type of X variable is also ndarray. I am not sure what is trigerring the issue. Can some one please help or direct me to right trace?
complete stack of error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-1310132c88db> in <module>
30 D3_win.addInstance(X,y)
31 xx = np.array(X,dtype='float64')
---> 32 y_hat = stream_clf.predict(xx)
33
34
~/conceptDrift/projectTest/lib/python3.5/site-packages/skmultiflow/trees/hoeffding_tree.py in predict(self, X)
1068 r, _ = get_dimensions(X)
1069 predictions = []
-> 1070 y_proba = self.predict_proba(X)
1071 for i in range(r):
1072 index = np.argmax(y_proba[i])
~/conceptDrift/projectTest/lib/python3.5/site-packages/skmultiflow/trees/hoeffding_tree.py in predict_proba(self, X)
1099 votes = normalize_values_in_dict(votes, inplace=False)
1100 if self.classes is not None:
-> 1101 y_proba = np.zeros(int(max(self.classes)) + 1)
1102 else:
1103 y_proba = np.zeros(int(max(votes.keys())) + 1)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'type'
Code
import findspark
findspark.init()
import pyspark as ps
import warnings
from pyspark.sql import functions as fn
import sys
from pyspark import SparkContext,SparkConf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score as AUC
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
from skmultiflow.trees.hoeffding_tree import HoeffdingTree
from skmultiflow.data.data_stream import DataStream
import time
def drift_detector(S,T,threshold = 0.75):
T = pd.DataFrame(T)
#print(T)
S = pd.DataFrame(S)
# Give slack variable in_target which is 1 for old and 0 for new
T['in_target'] = 0 # in target set
S['in_target'] = 1 # in source set
# Combine source and target with new slack variable
ST = pd.concat( [T, S], ignore_index=True, axis=0)
labels = ST['in_target'].values
ST = ST.drop('in_target', axis=1).values
# You can use any classifier for this step. We advise it to be a simple one as we want to see whether source
# and target differ not to classify them.
clf = LogisticRegression(solver='liblinear')
predictions = np.zeros(labels.shape)
# Divide ST into two equal chunks
# Train LR on a chunk and classify the other chunk
# Calculate AUC for original labels (in_target) and predicted ones
skf = StratifiedKFold(n_splits=2, shuffle=True)
for train_idx, test_idx in skf.split(ST, labels):
X_train, X_test = ST[train_idx], ST[test_idx]
y_train, y_test = labels[train_idx], labels[test_idx]
clf.fit(X_train, y_train)
probs = clf.predict_proba(X_test)[:, 1]
predictions[test_idx] = probs
auc_score = AUC(labels, predictions)
print(auc_score)
# Signal drift if AUC is larger than the threshold
if auc_score > threshold:
return True
else:
return False
class D3():
def __init__(self, w, rho, dim, auc):
self.size = int(w*(1+rho))
self.win_data = np.zeros((self.size,dim))
self.win_label = np.zeros(self.size)
self.w = w
self.rho = rho
self.dim = dim
self.auc = auc
self.drift_count = 0
self.window_index = 0
def addInstance(self,X,y):
if(self.isEmpty()):
self.win_data[self.window_index] = X
self.win_label[self.window_index] = y
self.window_index = self.window_index + 1
else:
print("Error: Buffer is full!")
def isEmpty(self):
return self.window_index < self.size
def driftCheck(self):
if drift_detector(self.win_data[:self.w], self.win_data[self.w:self.size], auc): #returns true if drift is detected
self.window_index = int(self.w * self.rho)
self.win_data = np.roll(self.win_data, -1*self.w, axis=0)
self.win_label = np.roll(self.win_label, -1*self.w, axis=0)
self.drift_count = self.drift_count + 1
return True
else:
self.window_index = self.w
self.win_data = np.roll(self.win_data, -1*(int(self.w*self.rho)), axis=0)
self.win_label =np.roll(self.win_label, -1*(int(self.w*self.rho)), axis=0)
return False
def getCurrentData(self):
return self.win_data[:self.window_index]
def getCurrentLabels(self):
return self.win_label[:self.window_index]
def select_data(x):
x = "/user/hadoop1/tellus/sea_1.csv"
peopleDF = spark.read.csv(x, header= True)
df = peopleDF.toPandas()
scaler = MinMaxScaler()
df.iloc[:,0:df.shape[1]-1] = scaler.fit_transform(df.iloc[:,0:df.shape[1]-1])
return df
def check_true(y,y_hat):
if(y==y_hat):
return 1
else:
return 0
df = select_data("/user/hadoop1/tellus/sea_1.csv")
stream = DataStream(df)
stream.prepare_for_use()
stream_clf = HoeffdingTree()
w = int(2000)
rho = float(0.4)
auc = float(0.60)
# In[ ]:
D3_win = D3(w,rho,stream.n_features,auc)
stream_acc = []
stream_record = []
stream_true= 0
i=0
start = time.time()
X,y = stream.next_sample(int(w*rho))
stream_clf.partial_fit(X,y, classes=stream.target_values)
while(stream.has_more_samples()):
X,y = stream.next_sample()
if D3_win.isEmpty():
D3_win.addInstance(X,y)
y_hat = stream_clf.predict(X)
Problem was with select_data() function, data type of variables was being changed during the execution. This issue is fixed now.

Solving simple ODE using scipy odeint gives straight line at 0

I am trying to solve a simple ODE:
dN/dt = N*(rho(t)-beta)/lambda
Rho is a function of time and I've generated it using linspace. The code is working for other equations but somehow gives a flat straight line at 0. (You can see it in the graph). Any guidelines about how to correct it?
import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt
def model2(N, t, rho):
beta_val = 0.0065
lambda_val = 0.00002
k = (rho - beta_val) / lambda_val
dNdt = k*N
print(rho)
return dNdt
# initial condition
N0 = [0]
# number of time points
n = 200
# time points
t = np.linspace(0,200,n)
rho = np.linspace(6,9,n)
#rho =np.array([6,6.1,6.2,6.3,6.4,6.5,6.6,6.7,6.8,6.9,7.0,7.1,7.2,7.3,7.4,7.5,7.6,7.7,7.8,7.9]) # Array of constants
# store solution
NSol = np.empty_like(t)
# record initial conditions
NSol[0] = N0[0]
# solve ODE
for i in range(1,n):
# span for next time step
tspan = [t[i-1],t[i]]
# solve for next step
N = odeint(model2,N0,tspan,args=(rho[i],))
print(N)
# store solution for plotting
NSol[i] = N[0][0]
# next initial condition
#z0 = N0[0]
# plot results
plt.plot(t,rho,'g:',label='rho(t)')
plt.plot(t,NSol,'b-',label='NSol(t)')
plt.ylabel('values')
plt.xlabel('time')
plt.legend(loc='best')
plt.show()
This is the graph I get after running this code
I modified your code (and the coefficients) to make it work.
When coefficients are also dependent of t, they have to be python functions called by the derivative function:
import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt
# Define
def model2(N, t, rho):
beta_val = 0.0065
lambda_val = 0.02
k = ( rho(t) - beta_val )/lambda_val
dNdt = k*N
return dNdt
def rho(t):
return .001 + .003/20*t
# Solve
tspan = np.linspace(0, 20, 10)
N0 = .01
N = odeint(model2, N0 , tspan, args=(rho,))
# Plot
plt.plot(tspan, N, label='NS;ol(t)');
plt.ylabel('N');
plt.xlabel('time'); plt.legend(loc='best');

Index error index 14238 is out of bounds for axis 0 with size 2

%pylab inline
import numpy as np
import pandas as pd
import random
import time
import scipy
import sklearn.feature_extraction
import pickle
from sklearn.cross_validation import StratifiedKFold
from sklearn.svm import LinearSVC
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
bedsizes = {'None':0,
'Rest All':1}
invbedsizes = {v: k for k, v in bedsizes.items()}
model = joblib.load('model_bed_size.pkl')
vocab = pickle.load(open('dictionary', 'rb'))
var=pd.read_csv('Train_variables.csv')
dtest = pd.read_csv('/home/ubuntu/test_null_new.csv', usecols= ("Bed_size","title","short_description","long_description","primary_shelf.all_paths_str","attributes.all_shelves.0","attributes.all_shelves.1","attributes.all_shelves.2","attributes.all_shelves.3","attributes.all_shelves.4","attributes.type.0","attributes.type.1","attributes.type.2","item_id","last_updated_at"),encoding='ISO-8859-1')
lentest = len(dtest)
vocab=vocab["Vocabulary"].to_dict()
Xall = []
i=1
for col in var['Variable']:
vectorizer = CountVectorizer(min_df=1, vocabulary=(vocab[i]), token_pattern = '\\b\\w+\\b')
Xall.append(vectorizer.transform(dtest[col].astype(str)))
j=i
i=j+1
print (col, 'Done', shape(Xall[-1]))
Xspall = scipy.sparse.hstack(Xall)
X_test_final = scipy.sparse.csr_matrix(Xspall)
print (shape(X_test_final))
ypred = model.decision_function(X_test_final)
ypredc = model.classes_[np.argmax(ypred, axis = 0)]
ypredcon = (np.max(ypred, axis = 1) + 2.) / 8.
ypredcon[ypredcon < 0.] = 0 .
ypredcon[ypredcon > 1.] = 1.
dfinal = pd.DataFrame()
dfinal['item_id '] = dtest['item_id']
dfinal['Predictions'] = ypredc
dfinal['Predictions'].replace(invbedsizes, inplace = True)
dfinal['confidence_score'] = ypredcon
The above code is giving an Index error saying that index 14328 is out of bounds for axis 0 and size 2.
The error is coming at this line
ypredc = model.classes_[np.argmax(ypred, axis = 0)]
Can anyone help me on this?
Without knowing much about the variables in your code, the error indicates that at
ypred = model.decision_function(X_test_final)
ypredc = model.classes_[np.argmax(ypred, axis = 0)]
error: index 14328 is out of bounds for axis 0 and size 2
model.classes_ is 1 or more dimensions, and the first is size 2, in other words 2 rows/classes, and possibly many columns.
ypred is probably quite large, and np.argmax(ypred...) is the index of its largest values (along axis 0), i.e. 14328.
Maye the correct use is model.classes_[:, np.argmax...].
You need to look at the shape of ypred, andmodel.classes_`, and possibly other variables in this area.

Wrong intercept in Spark linear regression

I am starting with Spark Linear Regression. I am trying to fit a line to a linear dataset. It seems that the intercept is not correctly adjusting, or probably I am missing something..
With intercept=False:
linear_model = LinearRegressionWithSGD.train(labeledData, iterations=100, step=0.0001, intercept=False)
This seems normal. But when I use intercept=True:
linear_model = LinearRegressionWithSGD.train(labeledData, iterations=100, step=0.0001, intercept=True)
The model that I get in the last case is exactly:
(weights=[0.0353471289751], intercept=1.0005127185289888)
I have tried with different datasets, step sizes and iterations, but always the model converges the intercept is about 1
EDIT - This is the code I am using:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
import numpy as np
import matplotlib.pyplot as plt
from pyspark import SparkContext
sc = SparkContext("local", "regression")
# Generate data
SIZE = 300
SLOPE = 0.1
BASE = -30
NOISE = 10
x = np.arange(SIZE)
delta = np.random.uniform(-NOISE,NOISE, size=(SIZE,))
y = BASE + SLOPE*x + delta
data = zip(range(len(y)), y) # zip with index
dataRDD = sc.parallelize(data)
# Normalize data
# mean = np.mean(data)
# std = np.std(data)
# dataRDD = dataRDD.map(lambda r: (r[0], (float(r[1])-mean)/std))
labeledData = dataRDD.map(lambda r: LabeledPoint(float(r[1]), [float(r[0])]))
# Create linear model
linear_model = LinearRegressionWithSGD.train(labeledData, iterations=1000, step=0.0002, intercept=True, convergenceTol=0.000001)
print linear_model
true_vs_predicted = labeledData.map(lambda p: (p.label, linear_model.predict(p.features))).collect()
# PLOT
fig = plt.figure()
ax = fig.add_subplot(111)
ax.grid()
y_real = [x[0] for x in true_vs_predicted]
y_pred = [x[1] for x in true_vs_predicted]
plt.plot(range(len(y_real)), y_real, 'o', markersize=5, c='b')
plt.plot(range(len(y_pred)), y_pred, 'o', markersize=5, c='r')
plt.show()
This is because the number of iterations and the step size are both smaller. As a result, The trial process is ending before reaching the local optima.

Resources