how to load huge dataframe in linear regression? - python-3.x

i have a dataframe with more than 1 millions row and i need do a linear regression on this dataframe in python3. but my Ram is 8 GB and i can't load the dataframe completely and run linear regression on that.
my code is as follow:
def get_data():
client = MongoClient(host='127.0.0.1', port=27017)
database = client['database']
collection = database['AI']
query = {}
return collection.find(query)
df = get_data()
xx = pd.DataFrame(df[0:100000])
xx = xx.iloc[:,2:]
xx.dropna(inplace = True)
X = np.array(xx.iloc[:,:-1])
y = np.array(xx['price']).reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
regr = LinearRegression()
regr.fit(X_train, y_train)
print(regr.score(X_test, y_test))

May not be possible for LinearRegression, but SGDRegessor has a method partial_fit to handle large datasets.
To quote : here
partial_fit(X, y, sample_weight=None)[source] Perform one epoch of
stochastic gradient descent on given samples.
Internally, this method uses max_iter = 1. Therefore, it is not
guaranteed that a minimum of the cost function is reached after
calling it once. Matters such as objective convergence and early
stopping should be handled by the user.

Related

Different results using OneVsRestClassifier(KNeighborsClassifier(n_neighbors=2)) compared to KNeighborsClassifier(n_neighbors=2)

I'm implementing a multi-class classifier and I'm getting different results when wrapping KNN in a multi-class classifier.
Unsure why as I understood KNN worked for multiclass already?
y = rock_df['Sample_type']
X = rock_df[col_list]
def model_eval(model, X,y):
""" Function implements classifier model on X and y with a 0.33 test hold out, stratified by y and returns accuracy and standard deviation
Inputs:
model: The ML model to be tested
X: the cleaned and preprocessed data (normalized, and NAN dealt with)
y: Target labels for input data X
"""
#Split train /test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify = y)
n = X_test.size
#Fit model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
#Scoring
confusion_matrix(y_test,y_pred)
balanced_accuracy_score( y_test, y_pred)
scores = cross_val_score(model, X, y, cv=3)
mean= scores.mean()
sd = scores.std()
print("For {} : {:.1%} accuracy on cross validation, with a standard deviation of {:.1%}".format(model, mean, sd) )
# binomial confidence interval - 95% -- confirm difference with SD
#interval = 1.96 * sqrt( (mean * (1 - mean)) /n )
#print('Confidence Interval: {:.3%}'.format(interval) )
#return balanced_accuracy_score, confusion_matrix
model = OneVsRestClassifier(KNeighborsClassifier(n_neighbors=2))
model_eval(model, X,y)
model = KNeighborsClassifier(n_neighbors=2)
model_eval(model, X,y)
First model I get:
For OneVsRestClassifier(estimator=KNeighborsClassifier(n_neighbors=2)) : 78.6% accuracy on cross validation, with a standard deviation of 5.8%
second:
For KNeighborsClassifier(n_neighbors=2) : 83.3% accuracy on cross validation, with a standard deviation of 8.9%
thanks
It is OK that you have different results. KNeighborsClassifier doesn't employ one-vs-rest strategy; majority vote works with 3 and more classes and there is no need to have OvR in the original implementation. But trying OneVsRestClassifier might be useful as well. I believe that generally decision boundaries will be different. Here I played with Iris dataset to get decision boundaries using KNeighborsClassifier(n_neighbors=5) and OneVsRestClassifier(KNeighborsClassifier(n_neighbors=5)):

Sklearn incorrect support value (number of samples in each class) for classification report

I am fitting an SVM to some data using sklearn. I have 24 samples in total (10 negative, 14 positive).
# Set model
clf = svm.SVC(kernel = 'linear', C = 1)
# Create train, test splits and fit SVM to data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.3, stratify = y)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
I stratified by y to ensure I have an equal number of each class in my test set, which seems to have worked (see image below), however, the classification report says there are no negative samples:
The signature for classification_report is (y_true, y_pred, ...); you've reversed the inputs.
Here's one of the places where using explicit keyword arguments is a good practice.

TabNetRegressor not working with reshaped data

I am using the PyTorch implementation of tabnet and cannot figure out why I'm still getting this error. I import the data to a dataframe, I use this function to get my X, and y then my train-test split
def get_X_y(df):
''' This function takes in a dataframe and splits it into the X and y variables
'''
X = df.drop(['is_goal'], axis=1)
y = df.is_goal
return X,y
X,y = get_X_y(df)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)
Then I use this to reshape my y_train
y_train.values.reshape(-1,1)
Then create an instance of the model and try to fit it
reg = TabNetRegressor()
reg.fit(X_train, y_train)
and I get this error
ValueError: Targets should be 2D : (n_samples, n_regression) but y_train.shape=(639912,) given.
Use reshape(-1, 1) for single regression.
I understand why I need to reshape it as this is pretty common, but I cannot understand why it's still giving me this error. I've restarted the kernel in notebooks so I don't think it's persistence memory issues either.
You have to re-assign it:
y_train = y_train.values.reshape(-1,1)
Otherwise, it won't change.

Using python 3 how to get co-variance/variance

I have a simple linear regression model and i need to count the variance and the co-variance. How to calculate variance and co-variance using linear regression ?
Variance, in the context of Machine Learning, is a type of error that occurs due to a model's sensitivity to small fluctuations in the training set.
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([2,3,4,5])
y = np.array([4,3,2,9] )
#train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
# Train the model using the training sets
model = LinearRegression()
model.fit(x_train, y_train)
y_predict = model.predict(X_predict)
Try this for the output vector that you get for variance and co-variance:
y_variance = np.mean((y_predict - np.mean(y_predict))**2)
y_covariace = np.mean(y_predict - y_true_values)
Note: Co-variance here is mean of change of predictions with respect to there true values.

Python - Custom Sampling to get training and testing data

I have a highly unbalanced dataset.
My dataset contains 1450 records and my outputs are binary 0 and 1. Output 0 has 1200 records and the 1 has 250 records.
I am using this piece of code to build my testing and training data set for the model.
from sklearn.cross_validation import train_test_split
X = Actual_DataFrame
y = Actual_DataFrame.pop('Attrition')
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=42, stratify=y)
But what I would like is a way through a function in which I want to specify the number of records for training and how much percent of them needs to come from class '0' and how much percent of them needs to come from class '1'.
So, a function which takes 2 Inputs are needed for creating the training_data:-
Total Number of Records for Training Data,
Number of Records that belongs to Class '1'
This would be a huge help to solve biased sampling dataset problems.
You can simply write a function that's very similar to the train_test_split from sklearn. The idea is that, from the input parameters train_size and pos_class_size, you can calculate how many positive class sample and negative class sample you will need.
def custom_split(X, y, train_size, pos_class_size, random_state=42):
neg_class_size = train_size = pos_class_size
pos_df = X[y == 1]
neg_df = X[y == 0]
pos_train = pos_df.sample(pos_class_size)
pos_test = pos_df[~pos_df.index.isin(pos_train.index)]
neg_train = neg_df.sample(neg_class_size)
neg_test = neg_df[~neg_df.index.isin(neg_train.index)]
X_train = pd.concat([pos_train,neg_train], axis=1)
X_test = pd.concat([pos_test,neg_test], axis=1)
y_train = y[X_train.index]
y_test = y[X_test.index]
return X_train, X_test, y_train, y_test
There are methods that are memory efficient or runs quicker, I didn't do any test with this code, but it should work.
At least, you should be able to get the idea behind.

Resources