SKlearn predicting tail instead of head - python-3.x

I'm trying to create a simple prediction using LinearRegression. In my mind, this should predict future values, but I'm clearly getting something wrong. It seems like it's getting data from the tail of the dataframe, instead of the most recent datapoints. I'm using google's stock prices and alpha_vantage's
API to get the stock info
ts = TimeSeries(key=api_key, output_format='pandas')
df, df_meta = ts.get_daily(symbol='GOOGL', outputsize='full')
df = df[['1. open', '2. high', '3. low', '4. close', '5. volume']]
df['HL_PCT'] = (df['2. high'] - df['4. close']) / df['4. close'] * 100.0
df['PCT_change'] = (df['4. close'] - df['1. open']) / df['1. open'] * 100.0
df = df[['4. close', 'HL_PCT', 'PCT_change', '5. volume']]
forecast_col = '4. close'
df.fillna(-99999, inplace=True)
forecast_out = int(math.ceil(0.01*len(df)))
print(forecast_out)
# Moving columns negatively
df['label'] = df[forecast_col].shift(-forecast_out)
# Features
X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
X = X[:-forecast_out]
df.dropna(inplace=True)
# Labels
y = np.array(df['label'])
y = np.array(df['label'])
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
clf = LinearRegression(n_jobs=-1)
# Fit is synonymous with train
clf.fit(X_train, y_train)
# Score is synonymous with test
accuracy = clf.score(X_test, y_test)
forecast_set = clf.predict(X_lately)
print(forecast_set, accuracy, forecast_out)
It returns values around the 200's, which clearly isn't predictions for 2020.

I found out that the tutorial i was following had the dataframe flipped, so that my recent datapoints, were his oldest datapoints. Easily resolved by flipping my own dataframe using:
df = df[::-1]

Related

svm returns the same accuracy no matter which dataset used

I have 50 different pairs of training and test sets. No matter which pair I choose I always get 0.5 as the accuracy. So something is definitely off. I would really appreciate it, if you could check that I am using everything correctly. The data consists of vectors that have to be classified correctly (either 1 or 0).
Below is my code:
train_set = pd.read_csv('C:/Users/.../train_set7.csv', sep=',')
test_set = pd.read_csv('C:/Users/.../test_set7.csv', sep=',')
train_set_values = train_set.iloc[:,0:51]
labels_train = train_set['50']
vects_train = [i for i in train_set_values.values]
test_set_values = test_set.iloc[:,0:51]
labels_test = test_set['50']
vects_test = [i for i in test_set_values.values]
clf = svm.SVC(kernel='linear', C=1000)
clf.fit(vects_train, labels_train)
from sklearn.metrics import accuracy_score
pred = clf.predict(vects_test)
accuracy_score(labels_test, pred)
>0.5
clf.score(vects_test, labels_test)
>0.5

Train_test_split set with 50-50 returns high accuracy but low when separated in 2 files

I have 1 dataset (called train_plus_test.csv) that has 1275 rows with corresponding columns and labels for classification of two activities, namely Walking and Lying. This is a balanced dataset with same number of each class.
I implement Random Forest in 2 scenarios
Scenario 1: Training on train_plus_test.csv with the train-test split of 0.75 - 0.25, it gives 91.8% accuracy
Scenario 2: Divide the above file train_plus_test.csv into 2 files (training.csv) and testing (testing.csv) at the split of 75% - 25%. Then I train the model on train.csv and predict on test.csv, but the accuracy is 52%. I am Now wondering where exactly I am wrong? ##
Thank you for your reading!
The python code (below) and 3 csv files above I have included here:
[GoogleDrive] https://drive.google.com/drive/folders/1AAOOFhR1QpoPPtSNTofBnouBaYHfFbir?usp=sharing&fbclid=IwAR10SjHCu-6Sszd-okes-IneAA8pWzals9-NNtAsmrw0ql28mk3geZfmnQI
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
# Scenario 1 ==================>
dataset = pd.read_csv('train_plus_test.csv')
feature_cols = list(dataset.columns.values)
feature_cols.remove('label')
X = dataset[feature_cols] # Features
y = dataset['label'] # Target
clf_RF = RandomForestClassifier(n_estimators=100, random_state=0, max_features=8, min_samples_leaf=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
clf_RF.fit(X_train, y_train)
y_pred_RF = clf_RF.predict(X_test)
print('Accuracy of training')
print(metrics.accuracy_score(y_test, y_pred_RF))
# Scenario 2 ======= comment Secenario 1 before running the scenario 2 ===========>
train_dataset = pd.read_csv('train.csv')
test_dataset = pd.read_csv('test.csv')
feature_cols = list(train_dataset.columns.values)
feature_cols.remove('label')
clf_RF = RandomForestClassifier(n_estimators=100, random_state=0, max_features=8, min_samples_leaf=3 )
X = train_dataset[feature_cols] # Features
y = train_dataset['label'] # Target
clf_RF.fit(X, y)
X_test_data = test_dataset[feature_cols]
y_test_data = test_dataset['label']
y_test_pred = clf_RF.predict(X_test_data)
print('Accuracy of testing')
print(metrics.accuracy_score(y_test_data, y_test_pred))

3-layer feedfoward neural network not predicting regression values accurately

I'm pretty new to Tensorflow. Currently, I'm doing a 3-layer network, with 10 neurons in the hidden layer with ReLU, mini-batch gradient descent size of 8, L2 regularisation weight decay parameter (beta) of 0.001. The Tensorflow version I'm using is 1.14 and I'm on Python 3.6.
The issue that boggles my mind is that my predicted values and testing errors are absolutely off the charts.
For example, I plotted out the test errors and the predicted vs target values for a sample size of 50, and this is what came out.
As you can see, both plots are way off, and I haven't had the slightest clue as to why.
Here's how the dataset roughly looks like. The first column is discarded as it is just a counter value, and the last column is the target.
My code:
NUM_FEATURES = 7
num_neuron = 10
batch_size = 8
beta = 0.001
learning_rate = 0.001
epochs = 4000
seed = 10
np.random.seed(seed)
# read and divide data into test and train sets
total_dataset= np.genfromtxt('dataset_excel.csv', delimiter=',')
X_data, Y_data = total_dataset[1:, 1:8], total_dataset[1:, -1]
Y_data = Y_data.reshape(Y_data.shape[0], 1)
# shuffle input, ensure both are shuffled with the same order
shufflestate = np.random.get_state()
np.random.shuffle(X_data)
np.random.set_state(shufflestate)
np.random.shuffle(Y_data)
# 70% used for training, 30% used for testing
trainX = X_data[:280]
trainY = Y_data[:280]
testX = X_data[280:]
testY = Y_data[280:]
trainX = (trainX - np.mean(trainX, axis=0)) / np.std(trainX, axis=0)
# Create the model
x = tf.placeholder(tf.float32, [None, NUM_FEATURES])
y_ = tf.placeholder(tf.float32, [None, 1])
# get 50 samples for plotting of predicted vs target values
limited50testX = testX[:50]
limited50testY = testY[:50]
# Hidden
with tf.name_scope('hidden'):
weight1 = tf.Variable(tf.truncated_normal([NUM_FEATURES, num_neuron],stddev=1.0,name='weight1'))
bias1 = tf.Variable(tf.zeros([num_neuron]),name='bias1')
hidden = tf.nn.relu(tf.matmul(x, weight1) + bias1)
# output
with tf.name_scope('linear'):
weight2 = tf.Variable(tf.truncated_normal([num_neuron, 1],stddev=1.0 / np.sqrt(float(num_neuron))),name='weight2')
bias2 = tf.Variable(tf.zeros([1]),name='bias2')
logits = tf.matmul(hidden, weight2) + bias2
ridgeLoss = tf.square(y_ - logits)
regularisation = tf.nn.l2_loss(weight1) + tf.nn.l2_loss(weight2)
loss = tf.reduce_mean(ridgeLoss + beta * regularisation)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.minimize(loss)
error = tf.reduce_mean(tf.square(y_ - logits))
N = len(trainX)
idx = np.arange(N)
predicted=[]
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
train_err = []
test_err_ = []
for i in range(epochs):
for batchStart, batchEnd in zip(range(0, trainX.shape[0], batch_size),range(batch_size, trainX.shape[0], batch_size)):
train_op.run(feed_dict={x: trainX[batchStart:batchEnd], y_: trainY[batchStart:batchEnd]})
err = error.eval(feed_dict={x: trainX, y_: trainY})
train_err.append(err)
if i % 100 == 0:
print('iter %d: train error %g' % (i, train_err[i]))
test_err = error.eval(feed_dict={x: testX, y_: testY})
test_err_.append(test_err)
predicted = sess.run(logits, feed_dict={x:limited50testX})
print("predicted values: ", predicted)
print("size of predicted values is", len(predicted))
print("targets: ", limited50testY)
print("size of target values is", len(limited50testY))
#plot predictions vs targets
numberList=np.arange(0, 50, 1).tolist()
predplot = plt.figure(1)
plt.plot(numberList, predicted, label='Predictions')
plt.plot(numberList, limited50testY, label='Targets')
plt.xlabel('50 samples')
plt.ylabel('Value')
plt.legend(loc='lower right')
predplot.show()
# plot training error
trainplot = plt.figure(2)
plt.plot(range(epochs), train_err)
plt.xlabel(str(epochs) + ' iterations')
plt.ylabel('Train Error')
trainplot.show()
#plot testing error
testplot = plt.figure(3)
plt.plot(range(epochs), test_err_)
plt.xlabel(str(epochs) + ' iterations')
plt.ylabel('Test Error')
testplot.show()
Not sure if that's it, but trainX is normalized whereas testX is not. You might want to use the same normalization on testX before predicting.

TensorFlow : polynomial regression

I am trying to get a non-linear regression from a CSV data which is available at this link:
CSV Data
I want to use polynomial regression. the problem is that the result I am getting from TensorFlow is "None". I cannot find the problem. I think there is something wrong with the model or the cost function. can anybody help? any help would be appreciated.
# importing modules
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import csv
import time
# defining the method for gathering data
# date_idx is the column number of date in the .CSV file
def read(filename, date_idx, date_parse, year, bucket=7):
# the amount of days in the year : 365 days
days_in_year = 365
# defining a dictionary for the frequency
freq = {}
# here we are calculating hao many buckets each frequency have?
# buckets = (which is 7, and by that we mean each frequency is 7 days)
# we are initializing each frequency with zero
for period in range(0, int(days_in_year / bucket)):
freq[period] = 0
# this opens the file in binary mode('rb' : 'r' for read, 'b' is for binary mode)
with open(filename, 'r') as csvfile:
csvreader = csv.reader(csvfile)
next(csvreader) # this escapes the first row since it consists of headers only
for row in csvreader:
if row[date_idx] == '': # each row consists of many columns but if the date is
continue # is unavailable there is no need to check the data
t = time.strptime(row[date_idx], date_parse) # converting to the input format
if t.tm_year == year and t.tm_yday < (days_in_year-1): # we want the data in specific year
freq[int(t.tm_yday / bucket)] += 1 # finding the frequency
return freq
# here i call the method to gather data for me
freq = read(r'C:\My Files\Programming\Python\TensorFlow\CallCenter\311_Call_Center_Tracking_Data__Archived_.csv',
0, '%m/%d/%Y', 2014)
# here we convert our dictionary into 2 arrays or lists in python
x_temp =[]
y_temp =[]
for key, value in freq.items():
x_temp.append(key)
y_temp.append(value)
x_data = np.asarray(x_temp)
y_data = np.asarray(y_temp)
# visualizing the data
plt.scatter(x_data,y_data)
plt.show()
# splitting data with ratio into 2 group : training and test
def split_dataset(x_dataset, y_dataset, ratio):
arr = np.arange(x_dataset.size)
np.random.shuffle(arr)
num_train = int(ratio*x_dataset.size)
x_train = x_dataset[arr[0:num_train]]
y_train = y_dataset[arr[0:num_train]]
x_test = x_dataset[arr[num_train:x_dataset.size]]
y_test = y_dataset[arr[num_train:y_dataset.size]]
return x_train,y_train,x_test,y_test
x_train, y_train, x_test, y_test = split_dataset(x_data,y_data, ratio=0.7)
# here we create some place holder for input and output of the session
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# defining global variables
learning_rate = 0.01
training_epochs = 100
num_coeffs = 5
# adding regularization (for later use)
#reg_lambda = 0.
# defining the coefficients of the polynomial
w = tf.Variable([0.]*num_coeffs, name='parameter')
# defining the model
def model(X,w):
terms = []
for i in range(num_coeffs):
term = tf.multiply(w[i], tf.pow(X, i))
terms.append(term)
return tf.add_n(terms)
y_model = model(X,w)
# defining the cost function
cost = tf.reduce_sum(tf.pow(Y-y_model,2))
# defining training method
train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# initilizing all variables
init = tf.global_variables_initializer()
#runing the model
with tf.Session() as sess:
sess.run(init)
for epoch in range(training_epochs):
training_cost = sess.run(train_op, feed_dict={X:x_train, Y:y_train})
print(training_cost)
final_cost = sess.run(cost,feed_dict={X: x_test, Y:y_test})
print('Final cost = {}'.format(training_cost))
The problem is that training_cost = sess.run(train_op, feed_dict={X:x_train, Y:y_train}) doesn't return the training cost, because train_op is the operation that updates the parameters using gradient descent, not the operation that computes the cost function.
If you want to get the training cost, you should do the following:
_, training_cost = sess.run([train_op, cost], feed_dict={X:x_train, Y:y_train})
Where cost is the operation that you have defined previously as cost = tf.reduce_sum(tf.pow(Y-y_model,2))
I changed the code as follow:
# importing modules
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import csv
import time
# defining the method for gathering data
# date_idx is the column number of date in the .CSV file
from pylint.checkers.raw_metrics import get_type
def read(filename, date_idx, date_parse, year, bucket=7):
# the amount of days in the year : 365 days
days_in_year = 365
# defining a dictionary for the frequency
freq = {}
# here we are calculating hao many buckets each frequency have?
# buckets = (which is 7, and by that we mean each frequency is 7 days)
# we are initializing each frequency with zero
for period in range(0, int(days_in_year / bucket)):
freq[period] = 0
# this opens the file in binary mode('rb' : 'r' for read, 'b' is for binary mode)
with open(filename, 'r') as csvfile:
csvreader = csv.reader(csvfile)
next(csvreader) # this escapes the first row since it consists of headers only
for row in csvreader:
if row[date_idx] == '': # each row consists of many columns but if the date is
continue # is unavailable there is no need to check the data
t = time.strptime(row[date_idx], date_parse) # converting to the input format
if t.tm_year == year and t.tm_yday < (days_in_year-1): # we want the data in specific year
freq[int(t.tm_yday / bucket)] += 1 # finding the frequency
return freq
# here i call the method to gather data for me
freq = read(r'C:\My Files\Programming\Python\TensorFlow\CallCenter\311_Call_Center_Tracking_Data__Archived_.csv',
0, '%m/%d/%Y', 2014)
# here we convert our dictionary into 2 arrays or lists in python
x_temp =[]
y_temp =[]
for key, value in freq.items():
x_temp.append(key)
y_temp.append(value)
x_data = np.asarray(x_temp)
x_data = x_data.astype(float)
y_data = np.asarray(y_temp)
y_data = y_data.astype(float)
print(x_data)
print(y_data)
# visualizing the data
plt.scatter(x_data,y_data)
plt.show()
# splitting data with ratio into 2 group : training and test
def split_dataset(x_dataset, y_dataset, ratio):
arr = np.arange(x_dataset.size)
np.random.shuffle(arr)
num_train = int(ratio*x_dataset.size)
x_train = x_dataset[arr[0:num_train]]
y_train = y_dataset[arr[0:num_train]]
x_test = x_dataset[arr[num_train:x_dataset.size]]
y_test = y_dataset[arr[num_train:y_dataset.size]]
return x_train,y_train,x_test,y_test
x_train, y_train, x_test, y_test = split_dataset(x_data,y_data, ratio=0.7)
print(type(x_train[0]))
# here we create some place holder for input and output of the session
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# defining global variables
learning_rate = 0.01
training_epochs = 100
num_coeffs = 5
# adding regularization (for later use)
#reg_lambda = 0.
# defining the coefficients of the polynomial
w = tf.Variable([0.]*num_coeffs, name='parameter')
# defining the model
def model(X,w):
terms = []
for i in range(num_coeffs):
term = tf.multiply(w[i], tf.pow(X, i))
terms.append(term)
return tf.add_n(terms)
y_model = model(X,w)
# defining the cost function
cost = tf.reduce_sum(tf.pow(Y-y_model,2))
# defining training method
train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# initilizing all variables
init = tf.global_variables_initializer()
#runing the model
with tf.Session() as sess:
sess.run(init)
for epoch in range(training_epochs):
_, training_cost = sess.run([train_op, cost], feed_dict={X: x_train, Y: y_train})
print('Training_cost = {}'.format(training_cost))
final_cost = sess.run(cost,feed_dict={X: x_test, Y:y_test})
print('Final cost = {}'.format(training_cost))
The result changed from "nan" to this:
Training_cost = 11020688384.0
Final cost = 11020688384.0
Training_cost = 9.952021814670212e+34
Final cost = 9.952021814670212e+34
Training_cost = inf
Final cost = inf
Training_cost = inf
Final cost = inf
Training_cost = inf
Final cost = inf
Training_cost = nan
Final cost = nan
Training_cost = nan
Final cost = nan
I just made everything into a float, since multiply just accepts 2 floats.
I just changed the code as follows. the code is working now but the result is off it still needs more optimization I think on model definition. thanks to #gcucurull I was able to pull it off.
# importing modules
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import csv
import time
# defining the method for gathering data
# date_idx is the column number of date in the .CSV file
from pylint.checkers.raw_metrics import get_type
def read(filename, date_idx, date_parse, year, bucket=7):
# the amount of days in the year : 365 days
days_in_year = 365
# defining a dictionary for the frequency
freq = {}
# here we are calculating hao many buckets each frequency have?
# buckets = (which is 7, and by that we mean each frequency is 7 days)
# we are initializing each frequency with zero
for period in range(0, int(days_in_year / bucket)):
freq[period] = 0
# this opens the file in binary mode('rb' : 'r' for read, 'b' is for binary mode)
with open(filename, 'r') as csvfile:
csvreader = csv.reader(csvfile)
next(csvreader) # this escapes the first row since it consists of headers only
for row in csvreader:
if row[date_idx] == '': # each row consists of many columns but if the date is
continue # is unavailable there is no need to check the data
t = time.strptime(row[date_idx], date_parse) # converting to the input format
if t.tm_year == year and t.tm_yday < (days_in_year-1): # we want the data in specific year
freq[int(t.tm_yday / bucket)] += 1 # finding the frequency
return freq
# here i call the method to gather data for me
freq = read(r'C:\My Files\Programming\Python\TensorFlow\CallCenter\311_Call_Center_Tracking_Data__Archived_.csv',
0, '%m/%d/%Y', 2014)
# here we convert our dictionary into 2 arrays or lists in python
x_temp =[]
y_temp =[]
for key, value in freq.items():
x_temp.append(key)
y_temp.append(value)
x_data = np.asarray(x_temp)
x_data = x_data.astype(float)
y_data = np.asarray(y_temp)
y_data = y_data.astype(float)
print(x_data)
print(y_data)
# visualizing the data
#plt.scatter(x_data,y_data)
#plt.show()
# splitting data with ratio into 2 group : training and test
def split_dataset(x_dataset, y_dataset, ratio):
arr = np.arange(x_dataset.size)
np.random.shuffle(arr)
num_train = int(ratio*x_dataset.size)
x_train = x_dataset[arr[0:num_train]]
y_train = y_dataset[arr[0:num_train]]
x_test = x_dataset[arr[num_train:x_dataset.size]]
y_test = y_dataset[arr[num_train:y_dataset.size]]
return x_train,y_train,x_test,y_test
x_train, y_train, x_test, y_test = split_dataset(x_data,y_data, ratio=0.7)
print(type(x_train[0]))
print(x_train)
# defining global variables
learning_rate = 0.000001
training_epochs = 10000
num_coeffs = 5
# defining the coefficients of the polynomial
w = tf.Variable(
tf.truncated_normal([num_coeffs,1], mean=0.0,stddev= 1.0, dtype=tf.float64))
# adding bias
b = tf.Variable(tf.zeros(1,dtype=tf.float64))
# predefining the model
def model(x, y):
# this predicts the y based on the given weight
temp = []
for i in range(num_coeffs):
temp.append(tf.add(w[i],tf.pow(x,i)))
prediction = tf.add(tf.reduce_sum(temp),b)
# this is the cost function
errors =tf.square(y - prediction)
return [prediction, errors]
# defining the model
y, cost = model(x_train, y_train)
# defining training method
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# initializing all variables
init = tf.global_variables_initializer()
#runing the model
with tf.Session() as sess:
sess.run(init)
for epoch in list(range(training_epochs)):
sess.run(optimizer)
if epoch%1000 ==0:
print('Training cost = \n',sess.run(cost))
print('---------------------------------------------------------------------------------')
print('---------------------------------------------------------------------------------')
y_prediction, cost_prediction = model(x_test, y_test)
print(sess.run(y_prediction))
print(y_test[-1])

For loop and Linear regression

Good evening,
I would like to reiterate both a subsetting and a linear regression, over the same data frame.
#I get the unique codes of the articles
codes = np.unique(data["cod_id"])
#Split
X = data['price']
y = data["quantity"]
accuracy = []
for i in np.nditer(codes):
data = data.loc[df["cod_id"] == i]
#Arrange an if statement to avoid 0-element arrays, while splitting (80% train, 20% test)
if int(len(data)) <= 2:
X_train = X
y_train = y
# Test dataset
X_test = X
y_test = y
else:
t = 0.8
t = int(t*len(data))
#Split
t = int(t*len(data))
# Train dataset
X_train = X[:t]
y_train = y[:t]
# Test dataset
X_test = X[t:]
y_test = y[t:]
#Run the Algorithm
lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
predicted_test_tr = lr.predict(X_test)
pred_cost = (X_test["price"] * predicted_test_tr).sum()
real_cost = (X_test["price"] * y_test).sum()
delta = (pred_cost - owner_cost)/owner_cost
accuracy.append(delta)
But it reports a list "accuracy", as long as the "codes" one, but with the same value at each position
print(accuracy)
5.43234
5.43234
5.43234
...
How can I fix this issue?
Thank you

Resources