train test data split using stratify on two columns in scikit-learn - scikit-learn

I have a dataset that I want to split into train and test so that I have data in the test set from each data source (specified in column "source") and from each class (specified in column "class"). I read about using the parameter stratifiy with scikitlearn's train_test_split function, but how can I use it on two columns?

Stratifying on multiple columns is easily done with sklearn's train_test_split since v.19.0
Proof
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_multilabel_classification
X, Y = make_multilabel_classification(1000000, 10, n_classes=2, n_labels=1)
train_X, test_X, train_Y, test_Y =train_test_split(X,Y,stratify=Y, train_size=.8, random_state=42)
Y.shape
(1000000, 2)
Then you can compare simple column means of resulting stratifications:
train_Y[:,0].mean(), test_Y[:,0].mean()
(0.45422, 0.45422)
train_Y[:,1].mean(), test_Y[:,1].mean()
(0.23472375, 0.234725)
Run statistical t-test on the equality of means:
from scipy.stats import ttest_ind
ttest_ind(train_Y[:,0],test_Y[:,0])
Ttest_indResult(statistic=0.0, pvalue=1.0)
And finally do the same for conditional means to prove that you indeed achieved what you wanted:
train_Y[train_Y[:,0].astype("bool"),1].mean(), test_Y[test_Y[:,0].astype("bool"),1].mean()
(0.43959149751221877, 0.43958874554180793)

Related

the right way to make prediction using Spacy word vectors

Im learning how to convert text into numbers for NLP problems and following a course Im learning about word vectors provided by Spacy package. the code works all fine from learning and evaluation but I have some problems regarding:
making prediction for new sentences, I cannot seems to make it work and most examples just fit the model then use X_test set for evaluation. ( Code below)
The person explaining stated that its bad( won't give good results) if I used
""
doc.vector over doc.vector.values
""
when trying both I don't see a difference, what is the difference between the two?
the example is to classify news title between fake and real
import spacy
import pandas as pd
df= pd.read_csv('Fake_Real_Data.csv')
print(df.head())
print(f"shape is: {df.shape}")
print("checking the impalance: \n ", df.label.value_counts())
df['label_No'] = df['label'].map({'Fake': 0, 'Real': 1})
print(df.head())
nlp= spacy.load('en_core_web_lg') # only large and medium model have word vectors
df['Text_vector'] = df['Text'].apply(lambda x: nlp(x).vector) #apply the function to EACH element in the column
print(df.head(5))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test= train_test_split(df.Text_vector.values, df.label_No, test_size=0.2, random_state=2022)
x_train_2D= np.stack(X_train)
x_test_2D= np.stack(X_test)
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB()
from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler()
scaled_train_2d= scaler.fit_transform(x_train_2D)
scaled_test_2d= scaler.transform(x_test_2D)
clf.fit(scaled_train_2d, y_train)
from sklearn.metrics import classification_report
y_pred=clf.predict(scaled_test_2d)
print(classification_report(y_test, y_pred))

scikit-learn: most important feature due to SelectKBest() is not the same feature of top node in DecisionTreeClassifier() with unedited data?

I am applying the breast cancer dataset to a decision tree as simple as possible:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
import graphviz
cancer = load_breast_cancer()
#print(cancer.feature_names)
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
tree = DecisionTreeClassifier(random_state=0, max_depth=2)
tree.fit(X_train, y_train)
print(f"\nscore train: {tree.score(X_train, y_train)}")
print(f"score test : {tree.score(X_test, y_test)}")
>>>
score train: 0.9413145539906104
score test : 0.9370629370629371
export_graphviz(tree, out_file=f"./src/dot/testing/breast_cancer.dot", class_names=['malignant', 'benign'], feature_names=cancer.feature_names, impurity=False, filled=True)
with open(f"./src/dot/testing/breast_cancer.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
Which lead to this graph:
Playing with feature selection, I want to get only the most important feature. In my understanding it should be the feature in the root-leaf, no? Unfortunately it's not, it's "worst concave points". Here is what I did to get the most important feature:
select = SelectKBest(k=1)
select.fit(X_train, y_train)
X_train_selected = select.transform(X_train)
print("X_train.shape : {}".format(X_train.shape))
print("X_train_selected.shape: {}\n".format(X_train_selected.shape))
>>>
X_train.shape : (426, 30)
X_train_selected.shape: (426, 1)
mask = select.get_support()
# plt.matshow(mask.reshape(1, -1), cmap='gray_r')
# plt.xlabel("Sample index")
print("most important features:")
for mask, feature in zip(mask, cancer.feature_names):
if mask: print(feature)
>>>
most important features:
worst concave points
I guess I am getting something wrong here. Could somebody clarify this? Any hint? Thanks
The most important feature does not necessarily mean that it will be the one used to make the first split. In fact, sklearn.tree.DecisionTreeClassifier uses entropy to decide which feature to use when making a split, so unless SelectKBest does this too, there is no need for both methods to reach the same conclusions in the same order. Even the same feature will reduce entropy differently in different stages of a tree classifier.
As a side note, trees do not always consider all features when making nodes. Take a look at max_features here. This means that, depending on your random-state and max_features hyper parameters, your tree may or may not have considered worst_concave_points when making the first split.

how do I standardize test dataset using StandardScaler in PySpark?

I have train and test datasets as below:
x_train:
inputs
[2,5,10]
[4,6,12]
...
x_test:
inputs
[7,8,14]
[5,5,7]
...
The inputs column is a vector containing the models features after applying the VectorAssembler class to 3 separate columns.
When I try to transform the test data using the StandardScaler as below, I get an error saying it doesn't have the transform method:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaledTrainDF = scaler.fit(x_train).transform(x_train)
scaledTestDF = scaler.transform(x_test)
I am told that I should fit the standard scaler on the training data only once and use those parameters to transform the test set, so it is not accurate to do:
scaledTestDF = scaler.fit(x_test).transform(x_test)
So how do I deal with the error mentioned above?
Here is the correct syntax to use the scaler. You need to call transform on a fitted model, not on the scaler itself.
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaler_model = scaler.fit(x_train)
scaledTrainDF = scaler_model.transform(x_train)
scaledTestDF = scaler_model.transform(x_test)

How to convert scalar array to 2d array?

I am new to machine learning and facing some issues in converting scalar array to 2d array.
I am trying to implement polynomial regression in spyder. Here is my code, Please help!
# Polynomial Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
# Predicting a new result with Polynomial Regression
lin_reg_2.predict(poly_reg.fit_transform(6.5))
ValueError: Expected 2D array, got scalar array instead: array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
You get this issue in Jupyter only.
To resolve in jupyter make the value into np array using below code.
lin_reg.predict(np.array(6.5).reshape(1,-1))
lin_reg_2.predict(poly_reg.fit_transform(np.array(6.5).reshape(1,-1)))
For spyder it work same as you expected:
lin_reg.predict(6.5)
lin_reg_2.predict(poly_reg.fit_transform(6.5))
The issue with your code is linreg.predict(6.5).
If you read the error statement it says that the model requires a 2-d array , however 6.5 is scalar.
Why? If you see your X data is having 2-d so anything that you want to predict with your model should also have two 2d shape.
This can be achieved either by using .reshape(-1,1) which creates a column vector (feature vector) or .reshape(1,-1) If you have single sample.
Things to remember in order to predict I need to prepare my data in the same way as my original training data.
If you need any more info let me know.
You have to give the input as 2D array, Hence try this!
lin_reg.predict([6.5])
lin_reg_2.predict(poly_reg.fit_transform([6.5]))

Why am I getting a score of 0.0 when finding the score of test data using Gaussian NB classifier?

I have two different data sets. One for training my classifier and the other one is for testing. Both the datasets are text files with two columns separated by a ",". FIrst column (numbers) is for the independent variable (group) and the second column is for the dependent variable.
Training data set
(just few lines for example. there are no empty lines between each row):
EMI3776438,1
EMI3776438,1
EMI3669492,1
EMI3752004,1
Testing data setup
(as you can see, i have picked data from the training data to be sure that the score surely can't be zero)
EMI3776438,1
Code in Python 3.6:
# #all the import statements have been ignored to keep the code short
# #loading the training data set
training_file_path=r'C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\modified_columns.txt'
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
training_file_data = pandas.read_table(training_file_path,
header=None,
names=['numbers','group'],
sep=',')
training_file_data = training_file_data.apply(le.fit_transform)
features = ['numbers']
x = training_file_data[features]
y = training_file_data["group"]
from sklearn.model_selection import train_test_split
training_x,testing_x, training_y, testing_y = train_test_split(x, y,
random_state=0,
test_size=0.1)
from sklearn.naive_bayes import GaussianNB
gnb= GaussianNB()
gnb.fit(training_x, training_y)
# #loading the testing data
testing_final_path=r"C:\Users\yyy\Desktop\my files\python\Machine learning\Carepack\testing_final.txt"
testing_sample_data=pandas.read_table(testing_final_path,
sep=',',
header=None,
names=['numbers','group'])
testing_sample_data = testing_sample_data.apply(le.fit_transform)
category = ["numbers"]
testing_sample_data_x = testing_sample_data[category]
# #finding the score of the test data
print(gnb.score(testing_sample_data_x, testing_sample_data["group"]))
First, the above data samples dont show how many classes are there in it. You need to describe more about it.
Secondly, you are calling le.fit_transform again on test data which will forget all the training samples mappings from strings to numbers. The LabelEncoder le will start encoding the test data again from scratch, which will not be equal to how it mapped training data. So the input to GaussianNB is now incorrect and hence incorrect results.
Change that to:
testing_sample_data = testing_sample_data.apply(le.transform)
UPDATE:
I'm sorry I overlooked the fact that you had two columns in your data. LabelEncoder only works on a single column of data. For making it work on multiple pandas columns at once, look at the answers of following question:
Label encoding across multiple columns in scikit-learn
If you are using the latest version of scikit (0.20) or can update to it, then you would not need any such hacks and directly use the OrdinalEncoder:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
training_file_data = enc.fit_transform(training_file_data)
And during testing:
training_file_data = enc.transform(training_file_data)

Resources