PySpark average TFIDF features by group - apache-spark

I have a collection of documents, each belonging to a specific page. I've computed the TFIDF scores across each document, but what I want to do is average the TFIDF score for each page based on its documents.
The desired output is an N (page) x M (vocabulary) matrix. How would I go about doing this in Spark/PySpark?
from pyspark.ml.feature import CountVectorizer, IDF, Tokenizer, StopWordsRemover
from pyspark.ml import Pipeline
tokenizer = Tokenizer(inputCol="message", outputCol="tokens")
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered")
countVec = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="features", binary=True)
idf = IDF(inputCol=countVec.getOutputCol(), outputCol="idffeatures")
pipeline = Pipeline(stages=[tokenizer, remover, countVec, idf])
model = pipeline.fit(sample_results)
prediction = model.transform(sample_results)
Output from the pipeline is in the format below. One row per document.
(466,[10,19,24,37,46,61,62,63,66,67,68,86,89,105,107,129,168,217,219,289,310,325,377,381,396,398,411,420,423],[1.6486586255873816,1.6486586255873816,1.8718021769015913,1.8718021769015913,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367])

I came up with the below answer. It works, but not sure its the most efficient. I based it off this post.
def as_matrix(vec):
data, indices = vec.values, vec.indices
shape = 1, vec.size
return csr_matrix((data, indices, np.array([0, vec.values.size])), shape)
def as_array(m):
v = vstack(m).mean(axis=0)
return v
mats = prediction.rdd.map(lambda x: (x['page_name'], as_matrix(x['idffeatures'])))
final = mats.groupByKey().mapValues(as_array).cache()
I stack the final into a single 86 x 10000 numpy matrix. Everything runs, but kind of slowly.
labels = [l[0] for l in final]
tf_matrix = np.vstack([r[1] for r in final])

Related

Sklearn's TfidfTransformer(use_idf=False, norm=None) returns the same output as CountVectorizer()

I am trying to understand the code behind TfidfTransformer(). From sklearn's documentation, I can get the term frequencies by setting use_idf=False. But when I check the code on Github, I noticed that the TfidfTransformer() will return the same value as CountVectorizer() when not using normalization, which is just the count of each term.
The code that is supposed to calculate term frequencies.
def transform(self, x, copy=True):
"""Transform a count matrix to a tf or tf-idf representation.
Parameters
----------
X : sparse matrix of (n_samples, n_features)
A matrix of term/token counts.
copy : bool, default=True
Whether to copy X and operate on the copy or perform in-place
operations.
Returns
-------
vectors : sparse matrix of shape (n_samples, n_features)
Tf-idf-weighted document-term matrix.
"""
X = self._validate_data(
X, accept_sparse="csr", dtype=FLOAT_DTYPES, copy-copy, reset=False
)
if not sp.issparse(X):
X = sp.csr_matrix(X, dtype=np.float64)
if self.sublinear_tf:
np.log(X.data, X.data)
X.data += 1
if self.use_idf:
# idf being a property, the automatic attributes detection
# does not work as usual and we need to specify the attribute not fitted")
# name:
check_is_fitted (self, attributes=["idf_"], msg="idf vector is not fitted")
# *= doesn't work
X = X * self._idf_diag
if self.norm is not None:
X = normalize(X, norm=self.norm, copy=False)
return X
image of code above
To investigate more, I ran both classes and compared the output of both CountVectorizer and TfidfTransformer using the following code and the output is equal.
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=(
'headers', 'footers', 'quotes'), subset='train', categories=['sci.electronics', 'rec.autos', 'rec.sport.hockey'])
train_documents = dataset.data
vectorizer = CountVectorizer()
train_documents_mat = vectorizer.fit_transform(train_documents)
tf_vectorizer = TfidfTransformer(use_idf=False, norm=None)
train_documents_mat_2 = tf_vectorizer.fit_transform(train_documents_mat)
equal = np.array_equal(
train_documents_mat.toarray(),
train_documents_mat_2.toarray()
)
print(equal)
I am trying to get the term frequencies for my documents rather than just the count. Any ideas why sklearn implement TF-IDF in this way?

Ranking all features in order using scikit-learn

I am trying to sort all features in order using scikit-learn f_regression and SelectKBest. The method works well if the number of ranked features k is smaller than the total number of features n. However, if I set k = n then the output from SelectKBest will be in the same order as the original feature array. How can I sort all features in order according to their importance?
The code is below:
from sklearn.feature_selection import SelectKBest, f_regression
n = len(training_features.columns)
selector = SelectKBest(f_regression, k = n)
selector.fit(training_features.values, training_targets.values[:, 0])
k_best_features = list(training_features.columns[selector.get_support(indices = True)])
I ended up using this solution:
import numpy as np
from sklearn.feature_selection import f_regression
k = 10 # number of best features to obtain
scores, _ = f_regression(training_features.values, training_targets.values[:, 0])
indices = np.argsort(scores)[::-1]
k_best_features = list(training_features.columns.values[indices[0:k]])
I thinking sorting the featuers, with respect to the scores given by f_regression, can be generated using
pd.DataFrame(dict(feature_names= training_features.columns , scores = selector.scores_))\
.sort_values('scores',ascending = False)

Pad data using tf.data.Dataset

I have to use tf.data.Dataset for creating a input pipeline for an RNN model in tensorflow. I am providing a basic code, by which I need to pad the data in batch with a pad token and use it for further manipulation.
import pandas as pd
import numpy as np
import tensorflow as tf
import functools
total_data_size = 10000
embedding_dimension = 25
max_len = 17
varying_length = np.random.randint(max_len, size=(10000)) # varying length data
X = np.array([np.random.randint(1000, size=(value)).tolist()for index, value in enumerate(varying_length)]) # data of arying length
Y = np.random.randint(2, size=(total_data_size)).astype(np.int32) # target binary
embedding = np.random.uniform(-1,1,(1000, embedding_dimension)) # word embedding
def gen():
for index in range(len(X)):
yield X[index] , Y[index]
dataset = tf.data.Dataset.from_generator(gen,(tf.int32,tf.int32))
dataset = dataset.batch(batch_size=25)
padded_shapes = (tf.TensorShape([None])) # sentence of unknown size
padding_values = (tf.constant(-111)) # the value with which pad index needs to be filled
dataset = (dataset
.padded_batch(25, padded_shapes=padded_shapes, padding_values=padding_values)
)
iter2 = dataset.make_initializable_iterator()
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run(iter2.initializer)
print(sess.run(iter2.get_next()))
I hope the code is self explanatory with comments. But I am getting following error,
InvalidArgumentError (see above for traceback): Cannot batch tensors with different shapes in component 0. First element had shape [11] and element 1 had shape [12].
[[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,?], [?]], output_types=[DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
I believe that since your generator yields two outputs, your padded_shapes and padded_values tuples must have a length of two. For me, this works:
dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32))
dataset = dataset.batch(batch_size=25)
padded_shapes = (tf.TensorShape([None]), tf.TensorShape([None])) # sentence of unknown size
padding_values = (tf.constant(-111), tf.constant(-111)) # the value with which pad index needs to be filled
dataset = (dataset
.padded_batch(25, padded_shapes=padded_shapes, padding_values=padding_values)
)
iter2 = dataset.make_initializable_iterator()
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run(iter2.initializer)
Finally got the answer. The issue was for the second padded shapes instead of Tensorshape([None]), we should provide [], because the second item returned by the generator is a scalar. If using Tensorshape([None]),, make sure we are returning a vector
import pandas as pd
import numpy as np
import tensorflow as tf
import functools
total_data_size = 10000
embedding_dimension = 25
max_len = 17
varying_length = np.random.randint(max_len, size=(10000)) # varying length data
X = np.array([np.random.randint(1000, size=(value)).tolist()for index, value in enumerate(varying_length)]) # data of arying length
Y = np.random.randint(2, size=(total_data_size)).astype(np.int32) # target binary
embedding = np.random.uniform(-1,1,(1000, embedding_dimension)) # word embedding
def gen():
for index in range(len(X)):
yield X[index] , Y[index]
dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32), (tf.TensorShape([None]), []))
padded_shapes = (tf.TensorShape([None]), []) # sentence of unknown size
dataset = (dataset
.padded_batch(25, padded_shapes=padded_shapes, padding_values=(-111, 0))
)
iter2 = dataset.make_initializable_iterator()
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run(iter2.initializer)
sess.run(iter2.get_next())

Make CountVectorizer faster for Large dataset

Hello i want to cluster movies based on their title only. My function works really good for my data but i have a big problem my sample is big 150.000 movies and its very slow actually took 3 days to cluster all movies
Process:
Sort movie titles based on their length
Transform movies with countvectorizer and calculate the similarity for each one (for every clustered movie i fit the vectorizer every time and i transform the target movie)
def product_similarity( clustered_movie, target_movie ):
'''
Calculates the title distance of 2 movies based on title
'''
# fitted vectorizer is a dictionary with fitted movies if wee dont fit to
# vectorizer the movie it fits and save it to dictionary
if clustered_movie in fitted_vectorizer:
vectorizer = fitted_vectorizer[clustered_movie]
a = vectorizer.transform([clustered_movie]).toarray()
b = vectorizer.transform( [target_movie] ).toarray()
similarity = cosine_similarity( a, b )
else:
clustered_movie = re.sub("[0-9]|[^\w']|[_]", " ",clustered_product )
vectorizer = CountVectorizer(stop_words=None)
vectorizer = vectorizer.fit([clustered_movie])
fitted_vectorizer[clustered_movie] = vectorizer
a = vectorizer.transform([clustered_movie]).toarray()
b = vectorizer.transform( [target_movie] ).toarray()
similarity = cosine_similarity( a, b )
return similarity[0][0]
Fit the CountVectorizer one time, on all titles. Save the model. Then transform using the fitted model.

How to predict Label of an email using a trained NB Classifier in sklearn?

I have created a Gaussian Naive Bayes classifier on a email (spam/not spam) dataset and was able to run it successfully. I vectorized the data, divided in it train and test sets and then calculated the accuracy, all the features that are present in the sklearn-Gaussian Naive Bayes classifier.
Now I want to be able to use this classifier to predict "labels" for new emails - whether they are by spam or not.
For example say I have an email. I want to feed it to my classifier and get the prediction as to whether it is a spam or not. How can I achieve this? Please Help.
Code for classifier file.
#!/usr/bin/python
import sys
from time import time
import logging
# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')
sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess
### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))
## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))
## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))
# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"
Code for Vectorization
#!/usr/bin/python
import os
import pickle
import numpy
numpy.random.seed(42)
path = os.path.dirname(os.path.abspath(__file__))
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"
feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()
## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()
features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced
def preprocess():
return features_train, features_test, labels_train, labels_test
Code for dataset generation
#!/usr/bin/python
import os
import pickle
import re
import sys
# sys.path.append("../tools/")
""
"
Starter code to process the texts of accuate and inaccurate category to extract
the features and get the documents ready for classification.
The list of all the texts from accurate category are in the accurate_files list
likewise for texts of inaccurate category are in (inaccurate_files)
The data is stored in lists and packed away in pickle files at the end.
"
""
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")
label_data = []
feature_data = []
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
label_data.append("0")
elif(name == "inaccurate"):
label_data.append("1")
line = text.readline()
text.close()
print("texts processed")
accurate_files.close()
inaccurate_files.close()
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))
Also I want to know whether i can incrementally train the classifier meaning thereby that retrain a created model with newer data for refining the model over time?
I would be really glad if someone can help me out with this. I am really stuck at this point.
You are already using your model to predict labels of emails in your test set. This is what pred = clf.predict(features_test) does. If you want to see these labels, do print pred.
But perhaps you what to know how you can predict labels for emails that you discover in the future and that are not currently in your test set? If so, you can think of your new email(s) as a new test set. As with your previous test set, you will need to run several key processing steps on the data:
1) The first thing you need to do is to generate features for your new email data. The feature generation step is not included in your code above, but will need to occur.
2) You are using a Tfidf vectorizer, which converts a collection of documents to a matrix of Tfidf features based upon term frequency and inverse document frequency. You need to put your new email test feature data through the vectorizer that you fit on your training data.
3) Then your new email test feature data will need to go through dimensionality reduction using the same selector that you fit on your training data.
4) Finally, run predict on your new test data. Use print pred if you want to view the new label(s).
To respond to your final question about iteratively re-training your model, yes you definitely can do this. It's just a matter of selecting a frequency, producing a script that expands your data set with incoming data, then re-running all steps from there, from pre-processing to Tfidf vectorization, to dimensionality reduction, to fitting, and prediction.

Resources