Cartesian product error on ALS recommendation - apache-spark

I'm trying to show a list of movie recommendations for an user. The model has been trained but when trying to show the prediction, I'm getting an error.
als = ALS(maxIter=5, regParam=0.01, userCol="userID",
itemCol="movieID", ratingCol="rating")
# ratings is a DataFrame of (movieID, rating, userID)
model = als.fit(ratings)
# allMovies is a DataFrame of (movieID, userID)
# it has userID=0 and all distinct movieID
recommendations = model.transform(allMovies)
recommendations.take(20)
Using the from pyspark.ml.recommendation.ALS library and
when running the last line, I'm getting the error
Detected cartesian product for LEFT OUTER join between logical plans.
Why is this happening? Thanks!

For answering my own question. It seems that you shouldn't use transform but the recommendForUserSubset method.

Before model.transform you have to define ALS something like ALS(input_col = 'something like input feature',output_col = predicted_rating) or you can have this way to work possibly.
rank = 10
numIterations = 100
model = ALS.train(ratings, rank, numIterations) #where ratings is dataframe
recommendation = model.predictAll(alMovies).map(lambda r: ((r[0], r[1]), r[2]))
Hope this helps.

Related

How to get all documents per topic in bertopic modeling

I have a dataset and trying to convert it to topics using berTopic modeling but the problem is, i cant get all the docoments of a topic. berTopic is only return 3 docoments per topic.
topic_model = BERTopic(verbose=True, embedding_model=embedding_model,
nr_topics = 'auto',
n_gram_range = (3,3),
top_n_words = 10,
calculate_probabilities=True,
seed_topic_list = topic_list,
)
topics, probs = topic_model.fit_transform(docs_test)
representative_doc = topic_model.get_representative_docs(topic#1)
representative_doc
this topic contain more then 300 documents but bertopic only shows 3 of them with .get_representative_docs
There are probably solutions that are more elegant because I am not an expert, but I can share what worked for me (as there are no answers yet):
"topics, probs = topic_model.fit_transform(docs_test)" returns the topics.
Therefore, you can combine this output and the documents.
For example, combine them into a (pandas.)dataframe using
df = pd.DataFrame({'topic': topics, 'document': docs_test})
Now you can filter this dataframe for each topic to identify the referring documents.
topic_0 = df[df.topic == 0]
There is an API from BERTopic get_document_info() which returns the dataframe for each document and associated topic for it. https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.get_document_info
The response from this API is shown below:
index
Document
Topic
Name
...
0
doc1_text
241
kw1_kw2_
...
1
doc2_text
-1
kw1_kw2_
...
You can use this dataframe to get all the documents associated for a particular topic using pandas groupby or however you prefer.
T = topic_model.get_document_info(docs)
docs_per_topics = T.groupby(["Topic"]).apply(lambda x: x.index).to_dict()
The code returns a dictionary shown as below:
{
-1: Int64Index([3,10,11,12,15,16,18,19,20,22,...365000], dtype='int64',length=149232),
0: Int64Index([907,1281,1335,1337,...308420,308560,308645],dtype='int64',length=5127),
...
}

Latent Dirichlet allocation (LDA) in Spark - replicate model

I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following:
1) Import packages
from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel
from pyspark.ml.feature import CountVectorizer , IDF
2) Preparing the dataset
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus = result_tfidf.select("id", "features")
3) Training the LDA model
lda = LDA(k=number_of_topics, maxIter=100, docConcentration = [alpha], topicConcentration = beta, seed = 123)
model = lda.fit(corpus)
model.save("LDA_model_saved")
topics = model.describeTopics(words_in_topic)
topics_rdd = topics.rdd
modelled_corpus = model.transform(corpus)
4) Replicate the model
#Prepare the data set
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus_new = result_tfidf.select("id", "features")
#Load the model to apply to new corpus
newModel = LocalLDAModel.load("LDA_model_saved")
topics_new = newModel.describeTopics(words_in_topic)
topics_rdd_new = topics_new.rdd
modelled_corpus_new = newModel.transform(corpus_new)
The following results are different despite my assumption to be equal:
topics_rdd != topics_rdd_new and modelled_corpus != modelled_corpus_new (also when inspecting the extracted topics they are different as well as the predicted classes on the dataset)
So I find it really strange that the same model predicts different classes ("topics") on the same dataset, even though I set a seed in the model generation. Can someone with experience in replicating LDA models help?
Thank you :)
I was facing similar kind of problem while implementing LDA in PYSPARK. Even though I was using seed, every time I re run the code on the same data with same parameters, results were different.
I came up with below solution after trying multitude of things:
Saved cv_model after running it once and loaded it in next iterations rather then re-fitting it.
This is more related to my data set. The size of some of the documents in the corpus that i was using was very small (around 3 words per document). I filtered out these documents and set a limit , such that only those documents will be included in corpus that have minimum 15 words (may be higher in yours). I am not sure why this one worked, may be something related underline complexity of model.
All in all now my results are same even after several iterations. Hope this helps.

Reconstructing k-means using pre-computed cluster centres

I'm using k-means for clustering with number of clusters 60. Since, some of the clusters are coming out as meaning less, I've deleted those cluster centers from cluster center array(count = 8) and saved in clean_cluster_array.
This time, I'm re-fitting k-means model with init = clean_cluster_centers. and n_clusters = 52 and max_iter = 1 because i want to avoid re-fitting as much as possible.
The basic idea is to recreate new model with clean_cluster_centers . The problem here is since, we are removing large number of clusters; The model is quickly configuring to more stable centers even with n_iter = 1. Is there any way to recreate k-means model?
If you've fitted a KMeans object, it has a cluster_centers_ attribute. You can directly update it by doing something like this:
cls.cluster_centers_ = new_cluster_centers
So if you want a new object with the clean cluster centers, just do something like the following:
cls = KMeans().fit(X)
cls2 = cls.copy()
cls2.cluster_centers_ = new_cluster_centers
And now, since the predict function only checks that your object has a non-null attribute called cluster_centers_, you can use the predict function
def predict(self, X):
"""Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, `cluster_centers_` is called
the code book and each value returned by `predict` is the index of
the closest code in the code book.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
Returns
-------
labels : array, shape [n_samples,]
Index of the cluster each sample belongs to.
"""
check_is_fitted(self, 'cluster_centers_')
X = self._check_test_data(X)
x_squared_norms = row_norms(X, squared=True)
return _labels_inertia(X, x_squared_norms, self.cluster_centers_)[0]

spark naive bayes prediction analysis

I have used Naive Bayes for Text Classification
Below is the link I used for understanding Naive Bayes
https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/
Though I got good prediction results , I was not able to understand the reason for the failure cases
I measured the probability of the features using predictProbabilities to understand the reason for
in correct prediction
Below is my understanding based on which I am trying to find why predictions are wrong in some cases
Assume my test data is like below ( I have around 100000 records for training )
Text Classification
There is a murder in town - HIGH SEVERITY
The old women was murdered - HIGH SEVERITY
Boy was hit by ball in street - LOW SEVERITY
John sprained his ankle while playing - LOW SEVERITY
Now when I do a prediction for below sentence
"There is a murder in city" - I expect the model to predict HIGH SEVERITY.
But at times the model predicts LOW SEVERITY
I pulled up all text which has same words and tried to figure out why this is happening .
If I compute the probability manually using the formula in https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/, it should have been predicted correctly .
But I could not find any clue why the prediction is going wrong.
Kindly let me know if I am missing any critical information
Code Snippet Added Below
My Training Data Frame consists of three columns "id" , "risk", "label"
The text is already lemmetized using stanford NLP
// TOKENIZE DATA
regexTokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("words")
.setPattern("\\W");
DataFrame tokenized = regexTokenizer.transform(trainingRiskData);
// REMOVE STOP WORDS
remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered");
DataFrame stopWordsRemoved = remover.transform(tokenized);
// COMPUTE TERM FREQUENCY USING HASHING
int numFeatures = 20;
hashingTF = new HashingTF().setInputCol("filtered").setOutputCol("rawFeatures")
.setNumFeatures(numFeatures);
DataFrame rawFeaturizedData = hashingTF.transform(stopWordsRemoved);
IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
idfModel = idf.fit(rawFeaturizedData);
DataFrame featurizedData = idfModel.transform(rawFeaturizedData);
JavaRDD<LabeledPoint> labelledJavaRDD = featurizedData.select("label", "features").toJavaRDD()
.map(new Function<Row, LabeledPoint>() {
#Override
public LabeledPoint call(Row arg0) throws Exception {
LabeledPoint labeledPoint = new LabeledPoint(new Double(arg0.get(0).toString()),
(org.apache.spark.mllib.linalg.Vector) arg0.get(1));
return labeledPoint;
}
});
NaiveBayes naiveBayes = new NaiveBayes(1.0, "multinomial");
NaiveBayesModel naiveBayesModel = naiveBayes.train(labelledJavaRDD.rdd(), 1.0);
Once the training model is built , test data is passed through the same transformations and prediction is done using below code
Column 3 is label in test data frame.
Column 7 is features in test data frame
LabeledPoint labeledPoint = new LabeledPoint(new Double(dataFrameRow.get(3).toString()),
(org.apache.spark.mllib.linalg.Vector) dataFrameRow.get(7));
double predictedLabel = naiveBayesModel.predict(labeledPoint.features());

How can I access computed metrics for each fold in a CrossValidatorModel

How can I get the computed metrics for each fold from a CrossValidatorModel in spark.ml? I know I can get the average metrics using model.avgMetrics but is it possible to get the raw results on each fold to look at eg. the variance of the results?
I am using Spark 2.0.0.
Studying the spark code here
For the folds, you can do the iteration yourself like this:
val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed))
//K-folding operation starting
//for each fold you have multiple models created cfm. the paramgrid
splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
val trainingDataset = sparkSession.createDataFrame(training, schema).cache()
val validationDataset = sparkSession.createDataFrame(validation, schema).cache()
val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
trainingDataset.unpersist()
var i = 0
while (i < numModels) {
val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)))
logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
metrics(i) += metric
i += 1
}
This is in scala, but the ideas are very clearly outlined.
Take a look at this answer that outlines results per fold. Hope this helps.

Resources