tfidf TfidfVectorizer to data frame Shape error - scikit-learn

I have some training data that am I trying to calculate the tif-idf values for:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
file_name = '../../data/spam.csv'
spam_data_df = pd.read_csv(file_name)
spam_data_df['target'] = np.where(spam_data_df['target']=='spam',1,0)
X_train, X_test, y_train, y_test = train_test_split(spam_data_df['text'],
spam_data_df['target'],
test_size=0.3,
random_state=0)
X_train_list = X_train.tolist()
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer_fit = tfidf_vectorizer.fit(X_train_list)
tfidf_vectorizer_vectors = tfidf_vectorizer.transform(X_train_list)
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_vectorizer_dense = tfidf_vectorizer_vectors.todense()
tfidf_dense_list = tfidf_vectorizer_dense.tolist()
df = pd.DataFrame(tfidf_vectorizer_dense,
index=feature_names,
columns=["tfidf"]).reset_index()
What I am looking for is to contract a table that looks like the following:
token tfidf
Mathews 0.99343
tait 0.02342
edwards 0.45453
anderson 0.21216
Here is an excerpt of the data:
text,target
"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",ham
Ok lar... Joking wif u oni...,ham
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,spam
U dun say so early hor... U c already then say...,ham
"Nah I don't think he goes to usf, he lives around here though",ham
The error I am seeing is:
ValueError: Shape of passed values is (3900, 7098), indices imply (7098, 1)
Please help

You can do
df = pd.DataFrame(tfidf_vectorizer_dense.T,
index=feature_names).reset_index()
# columns=["tfidf"])
It will return something like
token 0 1 2 ... 3899
Mathews 0.99343 0.12421 0.00000 ... 0.48674
tait 0.02342 0.00000 0.00000 ... 0.12421
edwards 0.45453 0.40727 0.09323 ... 0.00000
anderson 0.21216 0.30638 0.44592 ... 0.32154
...
Explanation
You have 3900 texts with 7098 tfidf features.
ValueError: Shape of passed values is (3900, 7098), indices imply (7098, 1)
The error implies that there is a mismatch between
shape of tfidf_vectorizer_dense - (3900, 7098)
shape set by index=feature_names (7098) and column=["tfidf"] (1) - (7098, 1).
Your goal is to match them so both are (7098, 3900).
You can do a transpose, tfidf_vectorizer_dense.T. After the transposition, it will have a shape of (7098, 3900). This aligns with the length of index.
For columns, you can just remove column=.

Related

Inverse Feature Scaling not working while predicting results

# Importing required libraries
import numpy as np
import pandas as pd
# Importing dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1: -1].values
y = dataset.iloc[:, -1].values
y = y.reshape(len(y), 1)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
scy = StandardScaler()
scX = StandardScaler()
X = scX.fit_transform(X)
y = scy.fit_transform(y)
# Training SVR model
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Predicting results from SCR model
# this line is generating error
scy.inverse_transform(regressor.predict(scX.transform([[6.5]])))
I am trying to execute this code to predict values from the model but after running it I am getting errors like this:
ValueError: Expected 2D array, got 1D array instead:
array=[-0.27861589].
Reshape your data either using an array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Complete Stack trace of error:
Even my instructor is using the same code but his one is working mine one not I am new to machine learning can anybody tell me what I am doing wrong here.
Thanks for your help.
This is the data for reference
It is because of the shape of your predictions, the scy is expecting an output with (-1, 1) shape. Change your last line to this:
scy.inverse_transform([regressor.predict(scX.transform([[6.5]]))])
You can also use this line to predict:
pred = regressor.predict(scX.transform([[6.5]]))
pred = pred.reshape(-1, 1)
scy.inverse_transform(pred)

how to do a linear fit where my variable X is vector in 3d?

I need to do a linear fit as follows:
Y=a*X+b
I need to find the values ​​of a and b that fit the experimental data
the first thing that occurred to me was to use the polyfit function,
but the problem is that in my data, X is a vector with 3 entries,
this is my code:
p_0=np.array([10,10,10])
p_1=np.array([100,10,10])
p_2=np.array([10,100,10])
p_3=np.array([10,10,100])
# Experimental data:
x=np.array([p_0,p_1,p_2,p_3])
y=np.array([35,60,75,65])
a=np.polyfit(x, y,1)
print(a)
I was expecting a list of lists to print, with the matrix and matrix b ... but I got TypeError("expected 1D vector for x")
Is there any way to do this with numpy or some other library?
sklearn can be used for this:
import numpy as np
from sklearn.linear_model import LinearRegression
model = LinearRegression()
p_0=np.array([10,10,10])
p_1=np.array([100,10,10])
p_2=np.array([10,100,10])
p_3=np.array([10,10,100])
# Experimental data:
x=np.array([p_0,p_1,p_2,p_3])
y=np.array([35,60,75,65])
model.fit(X=x, y=y)
print("coeff: ", *model.coef_)
print("intercept: ", model.intercept_)
output:
coeff: 0.27777777777777785 0.44444444444444464 0.33333333333333337
intercept: 24.444444444444436
A few other nice features of the sklearn package:
model.fit(x,y) # 1.0
model.rank_ # 3
model.predict([[1,2,3]]) # array([26.61111111])
One way to go about this is using numpy.linalg.lstsq:
# Experimental data:
x=np.array([p_0,p_1,p_2,p_3])
y=np.array([35,60,75,65])
A = np.column_stack([x, np.ones(len(x))])
coefs = np.linalg.lstsq(A, y)[0]
print (coefs)
# 0.27777778 0.44444444 0.33333333 24.44444444
Another option is to use LinearRegression from sklearn:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(x, y)
print (reg.coef_, reg.intercept_)
# array([0.27777778, 0.44444444, 0.33333333]), 24.444444444444443

Fewer than expected purity scores in PCA analysis

I'm trying to plot the line graph of purity scores against the captured variances in PCA. The goal is to plot the line graph of purity scores against the captured variances of 89% and 99% only. In my code when the components/dimensions are 2 it captures 89% of variance and and when components/dimensions are 4 it captures 99% of variance.
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("clustering.csv")
X10_df = df.drop("Class",axis = 1) #feature matrix
Y10_df = df["Class"] #Target vector
X10_df = np.array(X10_df)
Y10_df = np.array(Y10_df)
scaler = StandardScaler() # Standardizing the data
df_std = scaler.fit_transform(X10_df)
pca = PCA()
pca.fit(df_std)
purity = []
n_comp = range(2,5)
for k in n_comp :
pca = PCA(n_components = k)
pca.fit(df_std)
pca.transform(df_std)
scores_pca = pca.transform(df_std)
kmeans_pca = KMeans(n_clusters=3, init ='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y12 = kmeans_pca.fit_predict(scores_pca)
purity13 = purity_score(Y10_df, pred_y12)
purity.append(purity13)
Below function calculates the purity score :
def purity_score(y_true, y_pred):
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
However, while I have four variance scores, I only have three purity scores. I expected to have four purity scores so that I could create a plot of the variance vs purity.
Why there are only three purity scores?
Here is the link to my dataset file : https://gofile.io/d/3CgFTi
This is simply because when you using for loop with a range, the last number in the range is ignored. So in a range(2,5), it will go 2, 3, 4 and then quite the loop. Please read on for loop in Python.

RandomForestRegressor spitting out 1 prediction only

I am trying to work with the RandomForestRegressor. Using the RandomForestClassifier I seemed to be able to receive variable outcome of +/-1. However using the RandomForestRegressor I only get a constant value when I try to predict.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from pandas_datareader import data
import csv
import statsmodels.api as sm
data = pd.read_csv('C:\H\XPA.csv')
data['pct move']=data['XP MOVE']
# Features construction
data.dropna(inplace=True)
# X is the input variable
X = data[[ 'XPSpread', 'stdev300min']]
# Y is the target or output variable
y = data['pct move']
# Total dataset length
dataset_length = data.shape[0]
# Training dataset length
split = int(dataset_length * 0.75)
# Splitiing the X and y into train and test datasets
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
clf = RandomForestRegressor(n_estimators=1000)
# Create the model on train dataset
model = clf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
data['strategy_returns'] = data['pct move'].shift(-1) * -model.predict(X)
print(model.predict(X_test))
Output:
[4.05371547e-07 4.05371547e-07 4.05371547e-07 ... 4.05371547e-07
4.05371547e-07 4.05371547e-07]
The output is stationary while the y data is this:
0 -0.0002
1 0.0000
2 -0.0002
3 0.0002
4 0.0003
...
29583 0.0014
29584 0.0010
29585 0.0046
29586 0.0018
29587 0.0002
x-data:
XPSpread stdev300min
0 1.0 0.0002
1 1.0 0.0002
2 1.0 0.0002
3 1.0 0.0002
4 1.0 0.0002
... ... ...
29583 6.0 0.0021
29584 6.0 0.0021
29585 19.0 0.0022
29586 9.0 0.0022
29587 30.0 0.0022
Now when I change this problem to a classification problem I do get a relative good prediction of the sign. However when I change it to a regression I get a stationary outcome.
Any suggestions how I can improve this?
It may very well be the case that, with only two features, there is not enough information there for a numeric prediction (i.e. regression); while in a "milder" classification setting (predicting just the sign, as you say) you have some success.
The low number of features is not the only possible issue; judging from the few samples you have posted, one can easily see that, for example, your first 5 samples have identical features ([1.0, 0.0002]), while their corresponding y values can be anywhere in [-0.0002, 0.0003] - and the situation is similar for your samples #29583 & 29584. On the other hand, your samples #3 ([1.0, 0.0002]) and #29587 ([30.0, 0.0022]) look very dissimilar, but they end up having the same y value of 0.0002.
If the rest of your dataset has similar characteristics, it may just not be amenable to a decent regression modeling.
Last but not least, If your data are in any way "ordered" along some feature (they look like, but of course I cannot be sure with that small a sample), the situation is getting worse. What I suggest is to split your data using train_test_split, instead of doing it manually:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, shuffle=True)
which hopefully, due to shuffling, will result in a more favorable split. You may want to remove duplicate rows from the dataframe before shuffling and splitting (they are never a good idea) - see pandas.DataFrame.drop_duplicates.

h2o vs scikit learn confusion matrix

Anyone able to match the sklearn confusion matrix to h2o?
They never match....
Doing something similar with Keras produces a perfect match.
But in h2o they are always off. Tried it every which way...
Borrowed some code from:
Any difference between H2O and Scikit-Learn metrics scoring?
# In[30]:
import pandas as pd
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Train and cross-validate a GBM
model = H2OGradientBoostingEstimator(distribution="bernoulli", seed=1)
model.train(x=x, y=y, training_frame=train)
# In[31]:
# Test AUC
model.model_performance(test).auc()
# 0.7817203808052897
# In[32]:
# Generate predictions on a test set
pred = model.predict(test)
# In[33]:
from sklearn.metrics import roc_auc_score, confusion_matrix
pred_df = pred.as_data_frame()
y_true = test[y].as_data_frame()
roc_auc_score(y_true, pred_df['p1'].tolist())
#pred_df.head()
# In[36]:
y_true = test[y].as_data_frame().values
cm = pd.DataFrame(confusion_matrix(y_true, pred_df['predict'].values))
# In[37]:
print(cm)
0 1
0 1354 961
1 540 2145
# In[38]:
model.model_performance(test).confusion_matrix()
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.353664307031828:
0 1 Error Rate
0 964.0 1351.0 0.5836 (1351.0/2315.0)
1 274.0 2411.0 0.102 (274.0/2685.0)
Total 1238.0 3762.0 0.325 (1625.0/5000.0)
# In[39]:
h2o.cluster().shutdown()
This does the trick, thx for the hunch Vivek. Still not an exact match but extremely close.
perf = model.model_performance(train)
threshold = perf.find_threshold_by_max_metric('f1')
model.model_performance(test).confusion_matrix(thresholds=threshold)
I also meet the same issue. Here is what I would do to make a fair comparison:
model.train(x=x, y=y, training_frame=train, validation_frame=test)
cm1 = model.confusion_matrix(metrics=['F1'], valid=True)
Since we train the model using training data and validation data, then the pred['predict'] will use the threshold which maximizes the F1 score of validation data. To make sure, one can use these lines:
threshold = perf.find_threshold_by_max_metric(metric='F1', valid=True)
pred_df['predict'] = pred_df['p1'].apply(lambda x: 0 if x < threshold else 1)
To get another confusion matrix from scikit learn:
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(y_true, pred_df['predict'])
In my case, I don't understand why I get slightly different results. Something like, for example:
print(cm1)
>> [[3063 176]
[ 94 146]]
print(cm2)
>> [[3063 176]
[ 95 145]]

Resources