Get Spark metrics on each iteration step? - apache-spark

Applying spark's logistic regression on a specific dataset requires to define a number of iterations. So far I've learned that outputting the result of the cost function on each iteration might be useful information to plot. It can be used to visualize how many iterations a function needs to converge to a minimum. I was wondering if there is a way to output such information in spark? Looping over a train() function with different iteration numbers, sounds like a solution that requires a lot of time on large datasets. It would be nice to know if there is a better one already built in. Thanks for any advice on this topic.

After you've trained a model (call it myModel) that has such a history, you can get the iteration-by-iteration history with
There's a nice example here in the Spark ML documentation -- once you know the right search terms.


Can I remove some topics from a BERTopic model?

I trained a BERTopic model and analysed the resulting topics. About half of them are good, but the others I don't need. Can I remove them from the model, to get faster predictions?
I had the same question and I asked in github discussions for the package. If you ask there the package author answers very quickly.
Here is his answer to our question:
"Deleting topics is unlikely to help with speeding up the model in .transform as it is not possible to do that easily in the underlying models. Instead, I would either advise using the slower model and use .merge_topics to merge all unwanted topics into a single topic, so that it is easier to identify those. Or you can adjust the min_topic_size a bit lower to get a balance between helpful topics and speed of the transform function.
Do note that the transform function can be speed up by a number of different ways. For example, if you have an older GPU then embedding the documents can be much slower. In practice, it is helpful to identify which steps of the algorithm are relatively slow for you. By setting verbose=True, you have some indication of the time spent at each of those steps. If UMAP is too slow for you, then you can consider using PCA instead." -c Maarten Grootendorst
Also note that you could improve the speed of .transform by enabling the gpu acceleration for the latter two stages (by default only the first stage is gpu accelerated). You will find the info on that here

Efficient implementation of SOM (Self organizing map) on Pyspark

I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.

Aggregate training results to predits

When training the model the results depend on the sampling. In order to obtain something better you could repeat the training (in another randomly create training sample, using Ffolds, StratifiedKFold ... ), somehow aggregate the results and have this way a result that will be more robust that one create in a particular case alone. Question: is it already implemented in sklearn or similar?. Apologies is this is a straighforward question, I haven't see a simple solution.
I see that there is a function called cross_val_predict however my first impresion having a quick look to the source code is that it predecits as many times as trains and I would like to predicts only ones, so I can piclke the, somehow aggregate results, and predict later, instead of repeat the whole training thing again.
So far I think the best option are the ensemblers in sklearn.
I left here the solution I was using before. I am pretty sure could be improved (as mentioned before the Ensemblers in sklearn) are better. I have placed here, where I have left a notebook with and example (using iris database), in case anyone can play around and see in details how could be done.
That solution will train models (in parallel, using joblib), pickle the trained model (a model from SKlearn), store the results (using joblib dump) and later would recover them to create predictions (in parallel, using joblib) that later are aggregated.

Best for resume, document matching

I have used three different ways to calculate the matching between the resume and the job description. Can anyone tell me that what method is the best and why?
I used NLTK for keyword extraction and then RAKE for
keywords/keyphrase scoring, then I applied cosine similarity.
Scikit for keywords extraction, tf-idf and cosine similarity
Gensim library with LSA/LSI model to extract keywords and calculate
cosine similarity between documents and query.
Nobody here can give you the answer. The only way to decide which method works better is to have one or more humans independently match lots and lots of resumes and job descriptions, and compare what they do to what your algorithms do. Ideally you'd have a dataset of already matched resumes and job descriptions (companies must do this kind of thing when people apply), because it takes a lot of work to create a sufficiently large dataset.
Next time you take on this kind of project, start by considering how you are going to evaluate the performance of the solution you'll put together.
As already mentioned in answers, try ti use Doc2Vec.
Seems using Doc2Vec from Gensim on both corpora (CVs and job descriptions) separately and then using cosine similarity between the two vectors is the easiest flow to work. It works better than others on documents which are not similar in form and words content but similar in context and sematics, so merely keywords would not help much here.
Then you can try to train CNN on the corpus of pairs of matched CV&JD with labels like yes/no if available and use it to qulaify CVs/resumees against job descriptions.
Basically I'm going to try these aproaches in my pretty much the same task, pls see
Since its highly likely that job description and resume content can be different, you should think from semantics point of view. One thing possible you can do is use some domain knowledge. But its pretty difficult to gain domain knowledge for a variety of job types. Researchers sometimes use dictionary to augment the similarity matching between documents.
Researchers are using deep neural networks to capture both syntactic and semantic structure of documents. You can use doc2Vec to compare two documents. Gensim can produce doc2Vec representation for you. I believe that will give better results compared to keyword extraction and similarity computation. You can build your own neural network model to train on job descriptions and resumes. I guess neural networks will be effective for your work.

how to determine the number of topics for LDA?

I am a freshman in LDA and I want to use it in my work. However, some problems appear.
In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T).
My question is what does the "a series of" mean?
Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of topics.
If you are looking for deeper analyses, this paper on HDP reports the advantages of HDP in determining the number of groups.
A reliable way is to compute the topic coherence for different number of topics and choose the model that gives the highest topic coherence. But sometimes, the highest may not always fit the bill.
See this topic modeling example.
First some people use harmonic mean for finding optimal no.of topics and i also tried but results are unsatisfactory.So as per my suggestion ,if you are using R ,then package"ldatuning" will be useful.It has four metrics for calculating optimal no.of parameters. Again perplexity and log-likelihood based V-fold cross validation are also very good option for best topic modeling.V-Fold cross validation are bit time consuming for large dataset.You can see "A heuristic approach to determine an appropriate no.of topics in topic modeling".
Important links:
Let k = number of topics
There is no single best way and I am not even sure if there is any standard practices for this.
Method 1:
Try out different values of k, select the one that has the largest likelihood.
Method 2:
Instead of LDA, see if you can use HDP-LDA
Method 3:
If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA. For a small interval around this k, use Method 1.
Since I am working on that same problem, I just want to add the method proposed by Wang et al. (2019) in their paper "Optimization of Topic Recognition Model for News Texts Based on LDA". Besides giving a good overview, they suggest a new method. First you train a word2vec model (e.g. using the word2vec package), then you apply a clustering algorithm capable of finding density peaks (e.g. from the densityClust package), and then use the number of found clusters as number of topics in the LDA algorithm.
If time permits, I will try this out. I also wonder if the word2vec model can make the LDA obsolete.
