How to make randomsplit generate ordered items in across all splits - apache-spark

I am developing a prediction model on time-series data. I am using trainvalidationsplit in Spark to train and validate my model before testing it on an unforseen data.
Actually, in the validation phase, I need to have an ordered data based on my timestaps in input RDD (considering my RDD[timesoatmp, text]). I know trainvalidationsplit use randomsplit method which shuffle my training data. But I need to find a way when giving my trainigndata to trainvalidationsplit , and when it divided data for training and validation, the training and validation should be ordered by timestamp.
I wanted to know if is there any way to make radnomsplit generate elements across splitetd RDDs to be ordered in the follwing way. For example my RDD is (1,3,4,5,7,8), the rdd.randomsplit (0.5,0.5) should generate first RDD as (1,3,2) and the second one as (7,5,8)..the orders in each split is not important but overall the IDs (timestamps) in first split should be less than in second split.

Related

Spark- The purpose of saving ALS model

I'm trying to understand what would be a purpose of storing ALS model and what would be a use case for use of stored model.
I have a dataset which has over 300M rows and I'm using Hadoop Cluster and Spark to calculate recommendations based on ALS algorithm.
Whole computation takes around 5h and I'm wondering what would be the case of storing my model and use it- for example- the next day and... I don't see any. So, either I'm doing something wrong (which is possible, taking into account fact that I'm beginner in ML world) or ALS algorithm in Spark and possibility of saving on disk is not very helpful.
Right now, I use it as following:
df_input = spark.read.format("avro").load(PATH, schema=SCHEMA)
als = ALS(maxIter=12, regParam=0.05, rank=15, userCol="user", itemCol="item", ratingCol="rating", coldStartStrategy="drop")
model = als.fit(df_input)
df_recommendations = model.recommendForAllUsers(10)
And as I mentioned. df_input is a DataFrame which contains over 300M rows. Total calculation time is around 5h and after that I receive 10 recommended items for each user in the dataset.
In many tutorials or books. There is an example of training the model and validate it with test data. Something like:
df_input = spark.read.format("avro").load(PATH, schema=SCHEMA)
(training, test) = df_input.randomSplit(weights = [0.7, 0.3])
als = ALS(maxIter=12, regParam=0.05, rank=15, userCol="user", itemCol="item", ratingCol="rating", coldStartStrategy="drop")
model = als.train(training)
model.write().save("saved_model")
...
model = ALSModel.load('saved_model')
predictions = model.transform(test) // or df_input to get predictions for each user
I don't see any pros of using it in a such way. However I see a one big cons- You don't use 30% of data to train a model
As far as I know there isn't a way to use ALS model online (in real time). At least without using any external package/library.
You can't incrementally update this model.
You can't use it for newly registered users because there they don't exist in stored Matrix Factorization, so there won't be any recommendations for them.
All you can do is to check what would be a prediction for given user-item pair. Which is basically the same thing which would be return in the first example of code (with used fit() method)
What would be a reason to store this model on disk and load it when needed? or when (what conditions should be met) should I consider to store model and reuse it? Could you provide a use case?

Using a different test/train split for each target

I plan on using a data set which contains 3 target values of interest. Ultimately I will be trying classification methods on a binary target and also plan on using regression methods for two separate continuous targets.
Is is it a bad practice to do a different train/test split for each target variable?
Otherwise, I am not sure how to split the data in a way that will allow me to predict each target, separately.
If they're effectively 3 different models trained and evaluated separately then for the purposes of scientifically evaluating each model's performance it doesn't matter if you use different test-train splits for each model, as no information will be leaking from model to model. But if you plan on comparing the results of the 3 models or combining all 3 scores into some aggregate metric then you would want to use the same test-train split so that all 3 models are working from the same training data, as otherwise the performance of each model will likely depend to some extent on the test data for the other models, and therefore your combined score will to some extent be a function of your test data.

How to find out what a cluster represents on a PCA biplot?

I am building a K means algorithm and have multiple variables to feed into it. As of this I am using PCA to transform the data to two dimensions. When I display the PCA biplot I don't understand what similarities the data has to be grouped into a specific cluster. I am using a customer segmentation dataset. I.E: I want to be able to know that a specific cluster is a cluster as a customer has a low income but spends a lot of money on products.
Since you are using k-means:
Compute the mean of each cluster on the original data. Now you can compare these attributes.
Alternatively: don't use PCA in the first place, if it had your analysis... k-means is as good as PCA at coping with several dozen variables.

Decomposing Spark RDDs

In Spark, it is possible to compose multiple RDD into one, using zip, union, join, etc...
Is it possible to decompose RDD efficiently? Namely, without performing multiple passes on the original RDD? What I am looking for is some thing similar to:
val rdd: RDD[T] = ...
val grouped: Map[K, RDD[T]] = rdd.specialGroupBy(...)
One of the strengths of RDDs is that they enable performing iterative computations efficiently. In some (machine learning) use cases I encountered, we need to perform iterative algorithms on each of the groups separately.
The current possibilities I am aware of are:
GroupBy: groupBy returns an RDD[(K, Iterable[T])] which does not give you the RDD benefits on the group itself (the iterable).
Aggregations: Such as reduceByKey, foldByKey, etc. perform only one "iteration" over the data, and do not have the expression power for implementing iterative algorithms.
Creating separate RDD using the filter method and multiple passes on the data (where the number of passes is equal to the number of keys), which is not feasible when the number of keys is not very small.
Some of the use cases I am considering are, given a very large (tabular) dataset:
We wish to execute some iterative algorithm on each of the different columns separately. For example, some automated feature extraction, A natural way to do so, would have been to decompose the dataset such that each of the columns will be represented by a separate RDD.
We wish to decompose the dataset into disjoint datasets (for example a dataset per day) and execute some machine learning modeling on each of them.
I think the best option is to write out the data in a single pass to one file per key (see Write to multiple outputs by key Spark - one Spark job) then load the per-key files into one RDD each.

advanced feature extraction for cross-validation using sklearn

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle
Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.

Resources