Feature importance of each variable in GBT Classifier Pyspark [duplicate] - apache-spark

This question already has answers here:
PySpark & MLLib: Random Forest Feature Importances
(5 answers)
Closed 5 years ago.
How do I get the corresponding feature importance of every variable in a GBT Classifier model in pyspark

From spark 2.0+ (here) You have the attribute:
model.featureImportances
This will give a sparse vector of feature importance for each column/ attribute

Related

understanding spark data for MLlib data [duplicate]

This question already has answers here:
How to understand the format type of libsvm of Spark MLlib?
(1 answer)
How can I read LIBSVM models (saved using LIBSVM) into PySpark?
(1 answer)
Closed 4 years ago.
I am reading Binary classification used in SparkML data. I read the JavaCode of Spark, I am also aware of Binary classification but I am not able to understand, how these data are generated. for example https://github.com/apache/spark/blob/master/data/mllib/sample_binary_classification_data.txt
this link is sample for binary_classifcation if I want to generate these type of data, how to do that?
Usually, the first column is the class label (in this case 0 / 1), the others columns are the values of the features.
To generate the data yourself, you can use a random generator, for instance.
But it is depend on the problem you are working on.
If you need to download datasets to apply classification algorithms you can use repositories, such as: UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php

How can I operationalize a SparkML model as a real-time webservice? [duplicate]

This question already has answers here:
How to serve a Spark MLlib model?
(4 answers)
Closed 5 years ago.
Once a SparkML model has been trained on a Spark cluster, how can I take the trained model and make it available for scoring through a restful API?
The problem is that it requires a SparkContext in order to be loaded, but is there a way to 'fake it' since it does not seem really necessary, or what is the minimum required to create a SparkContext?
In some cases - yes, it can.
Many models in Spark can be exported to JPMML, standarized format for ML models. Then you can use it with other Java library like https://github.com/jpmml/jpmml-sparkml
How to export you can read in this question - Spark ml and PMML export.
You can also use Spark Streaming to calculate values, however it will have higher latency until Continous Processing Mode being available
For very time-consuming calculations, such as recommendation algorithms, it's I think quite normal to pre-calculate values and save in database like Cassandra

Keras LSTM with varying timesteps [duplicate]

This question already has an answer here:
Classifying sequences of different lengths [duplicate]
(1 answer)
Closed 5 years ago.
Suppose the input to the model is a series of vectors, each with equal length. However, the number of vectors in each input can change. I want to make an LSTM model using Keras, but if I were to write
input = keras.layers.input(dims)
img_out = keras.layers.recurrent.LSTM(16)
Then what would I put for "dims"? Thanks so much.
You can fix an upper bound for dims. When the input is less than dims, you can pad the rest with zero vector.

MLLib spark -ALStrainImplicit value more than 1 [duplicate]

This question already has an answer here:
Spark ALS recommendation system have value prediction greater than 1
(1 answer)
Closed 3 years ago.
Experimenting with Spark mllib ALS("trainImplicit") for a while now.
Would like to understand
1.why Im getting ratings value more than 1 in the predictions?
2.Is there any need for normalizing the user-product input?
sample result:
[Rating(user=316017, product=114019, rating=3.1923),
Rating(user=316017, product=41930, rating=2.0146997092620897)
]
In the documentation, it is mentioned that the predicted rating values will be somewhere around 0-1.
I know that the ratings values can still be used in recommendations but it would be great if I know the reason.
The cost function in ALS trainImplicit() doesn't impose any condition on predicted rating values as it takes the magnitude of difference from 0/1. So, you may also find some negative values there. That is why it says the predicted values are around [0,1] not necessarily in it.
There is one option to set non-negative factorization only, so that you never get a negative value in predicted rating or feature matrices, but that seemed to drop the performance for our case.

Difference between keras.backend.max and keras.backend.argmax [duplicate]

This question already has answers here:
numpy: what is the logic of the argmin() and argmax() functions?
(5 answers)
Closed 2 years ago.
I am a beginner in Deep Learning and while performing a practical assignment, came across the Keras documentation on keras.backend.
I went through the explanation a number of times. however, i cannot exactly understand the difference between max and argmax function.
argmax is the index of maximum in an array and max is maximum value in that array. Please check the example given below
import tensorflow as tf
x = tf.constant([1,10,2,4,15])
print(tf.keras.backend.argmax(x, axis=-1).numpy()) # output 4 (index of max value 15, which is 4)
print(tf.keras.backend.max(x, axis=-1).numpy()) # output 15

Resources