Multiclass classification with text and numerical data - scikit-learn

For a university assignment we got the task to predict the author's name based on a title and abstract and year. We received one train set and one test set which does not include the author's names.
To start we left out the 'year' column, but our teacher told us that it would be of importance to include that. The title and abstract column have been combined and we used hashing vectorizer and tfidf transformer to convert the text data to numerical. We had then applied hyperparameter tuning followed by the SGDClassifier from SKlearn.
Our problem is now that we want to use two independent variables (title/abstract & year), however after transforming and vectorizing the title/abstract we are unable to add the year column to train the data.
Do you have any tips to combine those?
The tfidf transformer leads to a numpy array which is shown vertically, and we could append the year but that would lead to an enormous dataset.
The picture shows the data before vectorization and transformation
My question now is how vectorize the 'labels' and 'year' column together by using the HashingVectorizer and Tfidf Transformer to use the SGDClassifier.
X = merged_data[['labels', 'year']]
label_vectorizer = HashingVectorizer(ngram_range=(1,2), n_features=2**18)
labels_vect = label_vectorizer.transform(X)
transformer = TfidfTransformer()
X = transformer.fit_transform(labels_vect)
y = merged_data['authorId']
print(X)
Output:
(0, 100016) -1.0
(1, 137240) 1.0
Or:
X = merged_data['labels']
label_vectorizer = HashingVectorizer(ngram_range=(1,2), n_features=2**18)
labels_vect = label_vectorizer.transform(X)
transformer = TfidfTransformer()
X = transformer.fit_transform(labels_vect)
y = merged_data['authorId']
print(X)
Output:
(0, 258452) 0.06262005509267683
(0, 258265) 0.09434206300878606
(0, 255798) 0.0942876246256502
(0, 254434) -0.06461787945275276
(0, 245917) 0.06743365473981061
(0, 245256) -0.06461787945275276
...
(12128, 84695) -0.14448685821655255
(12128, 73740) -0.09330855785758727
(12128, 68673) -0.09849506054591504
(12128, 64492) -0.12171733351456865
(12128, 62658) -0.13655467338246333
And then it would be needed to append the year? I am really not sure how to approach this.

Related

after using an Embedding layer in Keras why do I get: Input to reshape is a tensor with 2 values, but the requested shape has 4 [Op:Reshape]

So I created below sample polars data frame. I want to use Keras's normalisation and Embedding layers to preprocess my data. sum_cost and sum_gmv are my numerical columns and I normalize each individual column by using normalization layer.category is my categorical column and I want to use embedding layer to get embedding vectors for each category.
import polars as pl
import tensorflow as tf
df = pl.DataFrame(
{'sum_cost':[1.,4.,7.,3.,2.],
'category':[311,210,450,311,567],
'sum_gmv':[-4.,-2.,0.,2.,4.],
}
)
numeric_col = ['sum_cost','sum_gmv']
categorical_col = ['category']
all_inputs = []
inputs = {}
for col in numeric_col + categorical_col:
if col in numeric_col:
inputs[col] = tf.keras.Input(shape=(), name=col,dtype=tf.float32)
normalizer = tf.keras.layers.Normalization(axis=None)
normalizer.adapt(df[col].to_numpy())
all_inputs.append(normalizer(inputs[col])[:,tf.newaxis])
elif col in categorical_col:
inputs[col] = tf.keras.Input(shape=(), name=col,dtype=tf.int32)
embedding = tf.keras.layers.Embedding(
567 + 1,
output_dim=2,
name='cat_embedding')(inputs[col])
em_model = tf.keras.layers.Reshape((2,))(embedding)
all_inputs.append(em_model)
outputs = tf.keras.layers.Concatenate(axis=1)(all_inputs)
model=tf.keras.Model([inputs[col] for col in numeric_col+categorical_col],outputs)
When I want to test my preprocessing model by using a single data point model(dict(df.to_pandas().iloc[1,:])) I receive the following error.
on the other hand when I pass this input:
model({'sum_cost':1.,'sum_gmv':1,'category':np.array([[1.]])})
it works well. I dont understand why i should provide an array for category but scalar for numerical columns. In the original dataset they are all scalar. Also I dont deifne a shape for my Input tensors. Why does this happening and how can I solve it?
Thanks!

Is there a way to add a 'sentiment' column after applying CountVectorizer or TfIdfTransformer to a dataframe?

I am working with app store reviews to classify them as class "0" or class "1" based on the text in the review and the sentiment the review carries.
In my classification steps I apply the following methods to my dataframe:
def get_sentiment(s):
vs = analyzer.polarity_scores(s)
if vs['compound'] >= 0.5:
return 1
elif vs['compound'] <= -0.5:
return -1
else:
return 0
df['sentiment'] = df['review'].apply(get_sentiment)
For simplicity sake, the data has already been labeled as either class '0' or '1', but I am training the model for the classification of new instances that have not been labeled yet. In short, the data I'm working with has already been labeled. They are in the classification column.
Then in my train test split method do the following:
msg_train, msg_test, label_train, label_test = train_test_split(df.drop('classification', axis=1), df['classification'], test_size=0.3, random_state=42)
So the dataframe for the X parameter has review and sentiment, and for the y parameter I only have the classification that I am training my model on.
Since the normalization is repetitive, I am running a pipeline like so for simplicity:
pipeline1 = Pipeline([
('bow', CountVectorizer(analyzer=clean_review)),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())
])
Where the clean_review function is as follows:
def clean_review(sentence):
no_punc = [c for c in sentence if c not in string.punctuation]
no_punc = ''.join(no_punc)
no_stopwords = [w.lower() for w in no_punc.split() if w not in stopwords_set]
stemmed_words = [ps.stem(w) for w in no_stopwords]
return stemmed_words
Where stopwords_set is the collection of english stopwords from the nltk library, and ps is from the PortStemmer module in the nltk library (for word stemming).
I get the following error: ValueError: Found input variables with inconsistent numbers of samples: [2, 505]
When I searched this error before, I saw that the likely issue could've been that there is a mismatch in the number of records for each attribute. I've found this not to be the case. All the records that I am using have values for every column.
Can someone else help me interpret what this error could mean?
My end goal is to have a dataframe that has the CountVectorizer and TfIdfTransformer applied to the text, but also retain the column for the sentiment of each review.
I would then like to be able to train the MultinomialNB classifier on this dataframe and apply this model to other tasks.
I'm not sure on what the error is due to since I don't know what the size of your dataframe should be. I would need more information. On which line is the error thrown?
Regarding the fact that you want to retain the sentiment column, you could apply CountVectorizer and TfIdfTransformer (by the way you could skip a step and directly apply TfidfVectorizer) only on the text data and then have another transformer in the pipeline which adds the original sentiment column before you feed the dataframe to the classifier.

sklearn - label points of PCA

I am generating a PCA which uses scikitlearn, numpy and matplotlib. I want to know how to label each point (row in my data). I found "annotate" in matplotlib, but this seems to be for labeling specific coordinates, or just putting text on arbitrary points by the order they appear. I'm trying to abstract away from this but struggling due to the PCA sections that appear before the matplot stuff. Is there a way I can do this with sklearn, while I'm still generating the plot, so I don't lose its connection to the row I got it from?
Here's my code:
# Create a Randomized PCA model that takes two components
randomized_pca = decomposition.RandomizedPCA(n_components=2)
# Fit and transform the data to the model
reduced_data_rpca = randomized_pca.fit_transform(x)
# Create a regular PCA model
pca = decomposition.PCA(n_components=2)
# Fit and transform the data to the model
reduced_data_pca = pca.fit_transform(x)
# Inspect the shape
reduced_data_pca.shape
# Print out the data
print(reduced_data_rpca)
print(reduced_data_pca)
def rand_jitter(arr):
stdev = .01*(max(arr)-min(arr))
return arr + np.random.randn(len(arr)) * stdev
colors = ['red', 'blue']
for i in range(len(colors)):
w = reduced_data_pca[:, 0][y == i]
z = reduced_data_pca[:, 1][y == i]
plt.scatter(w, z, c=colors[i])
targ_names = ["Negative", "Positive"]
plt.legend(targ_names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title("PCA Scatter Plot")
plt.show()
PCA is a projection, not a clustering (you tagged this as clustering).
There is no concept of a label in PCA.
You can draw texts onto a scatterplot, but usually it becomes too crowded. You can find answers to this on stackoverflow already.

How do I use Theanets LSTM RNN's on my time series data?

I have a simple dataframe consisting of one column. In that column are 10320 observations (numerical). I'm simulating Time-Series data by inserting the data into a plot with a window of 200 observations each. Here is the code for plotting.
import matplotlib.pyplot as plt
from IPython import display
fig_size = plt.rcParams["figure.figsize"]
import time
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
fig, axes = plt.subplots(1,1, figsize=(19,5))
df = dframe.set_index(arange(0,len(dframe)))
std = dframe[0].std() * 6
window = 200
iterations = int(len(dframe)/window)
i = 0
dframe = dframe.set_index(arange(0,len(dframe)))
while i< iterations:
frm = window*i
if i == iterations:
to = len(dframe)
else:
to = frm+window
df = dframe[frm : to]
if len(df) > 100:
df = df.set_index(arange(0,len(df)))
plt.gca().cla()
plt.plot(df.index, df[0])
plt.axhline(y=std, xmin=0, xmax=len(df[0]),c='gray',linestyle='--',lw = 2, hold=None)
plt.axhline(y=-std , xmin=0, xmax=len(df[0]),c='gray',linestyle='--', lw = 2, hold=None)
plt.ylim(min(dframe[0])- 0.5 , max(dframe[0]) )
plt.xlim(-50,window+50)
display.clear_output(wait=True)
display.display(plt.gcf())
canvas = FigureCanvas(fig)
canvas.print_figure('fig.png', dpi=72, bbox_inches='tight')
i += 1
plt.close()
This simulates a flow of real-time data and visualizes it. What I want is to apply theanets RNN LSTM to the data to detect anomalies unsupervised. Because I am doing it unsupervised I don't think that I need to split my data into training and test sets. I haven't found much of anything that makes sense to me so far and have been googling for about 2 hours. Just hoping that you guys may be able to help. I want to put the prediction output of the RNN on the graph as well and define a threshold that, if the error is too large, the values will be identified as anomalous. If you need more information please comment and let me know. Thank you!
READING
Like neurons, LSTM networks are build of interconnected LSTM Blocks whose training is done via BackPropogation Through Time.
Classical anomaly detection using time series required prediction of time series output in future (at one or more points) and finding error on these points with true values. Prediction Error above a threshold will reflect and amomly
SOLUTION
Having said this
You've to train network so you need training sets and test sets both
Use N inputs to predict M outputs (decide upon N and M with experimentation - values for which training error is low)
Scroll a window of (N+M) elements in input data and use this data array of (N+M) items also termed as frame to train or test network.
Typically we use 90% of starting series for training and 10% for testing.
This scheme will fail as if training is not proper there will be false prediction errors which are not-anomaly. So make sure to provide enough training, and most important shuffle training frames and consider all variations.

Role of class_weight in loss functions for linearSVC and LogisticRegression

I am trying to figure out what exactly the loss function formula is and how I can manually calculate it when class_weight='auto' in case of svm.svc, svm.linearSVC and linear_model.LogisticRegression.
For balanced data, say you have a trained classifier: clf_c. Logistic loss should be (am I correct?):
def logistic_loss(x,y,w,b,b0):
'''
x: nxp data matrix where n is number of data points and p is number of features.
y: nx1 vector of true labels (-1 or 1).
w: nx1 vector of weights (vector of 1./n for balanced data).
b: px1 vector of feature weights.
b0: intercept.
'''
s = y
if 0 in np.unique(y):
print 'yes'
s = 2. * y - 1
l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
return l
I realized that logisticRegression has predict_log_proba() which gives you exactly that when data is balanced:
b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)
Note, np.floor((y+1)/2).astype(np.int8) simply maps y=(-1,1) to y=(0,1).
But this does not work when data is imbalanced.
What's more, you expect the classifier (here, logisticRegression) to perform similarly (in terms of loss function value) when data in balance and class_weight=None versus when data is imbalanced and class_weight='auto'. I need to have a way to calculate the loss function (without the regularization term) for both scenarios and compare them.
In short, what does class_weight = 'auto' exactly mean? Does it mean class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.} or rather class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}?
Any help is much much appreciated. I tried going through the source code, but I am not a programmer and I am stuck.
Thanks a lot in advance.
class_weight heuristics
I am a bit puzzled by your first proposition for the class_weight='auto' heuristic, as:
class_weight = {-1 : (y == 1).sum() / (y == -1).sum(),
1 : 1.}
is the same as your second proposition if we normalize it so that the weights sum to one.
Anyway to understand what class_weight="auto" does, see this question:
what is the difference between class weight = none and auto in svm scikit learn.
I am copying it here for later comparison:
This means that each class you have (in classes) gets a weight equal
to 1 divided by the number of times that class appears in your data
(y), so classes that appear more often will get lower weights. This is
then further divided by the mean of all the inverse class frequencies.
Note how this is not completely obvious ;).
This heuristic is deprecated and will be removed in 0.18. It will be replaced by another heuristic, class_weight='balanced'.
The 'balanced' heuristic weighs classes proportionally to the inverse of their frequency.
From the docs:
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
n_samples / (n_classes * np.bincount(y)).
np.bincount(y) is an array with the element i being the count of class i samples.
Here's a bit of code to compare the two:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
n_classes = 3
n_samples = 1000
X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10,
n_classes=n_classes, weights=[0.05, 0.4, 0.55])
print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)
print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))
Output:
Count of samples per class: [ 57 396 547]
Balanced weights: [ 5.84795322 0.84175084 0.60938452]
'auto' weights: [ 2.40356854 0.3459682 0.25046327]
The loss functions
Now the real question is: how are these weights used to train the classifier?
I don't have a thorough answer here unfortunately.
For SVC and linearSVC the docstring is pretty clear
Set the parameter C of class i to class_weight[i]*C for SVC.
So high weights mean less regularization for the class and a higher incentive for the svm to classify it properly.
I do not know how they work with logistic regression. I'll try to look into it but most of the code is in liblinear or libsvm and I'm not too familiar with those.
However, note that the weights in class_weight do not influence directly methods such as predict_proba. They change its ouput because the classifier optimizes a different loss function.
Not sure this is clear, so here's a snippet to explain what I mean (you need to run the first one for the imports and variable definition):
lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))
new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))
# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()
# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))
Hope this helps.

Resources