Thanks for reading.
I need some help here on something which should be simple but doesn't seem to be.
I'm running PCA in Pyspark and I'm looking to find out the names of the features it is selecting as part of the dimension reduction.
I get to the dense matrix but I'm not sure how to get to something interpretable. Ideally I'd like a spark dataframe with feature names included.
I'm having the same problems with the feature selectors within spark it seems theres no though gone into interpretability once these techniques are applied.
Who knew you might want to know which features were being selected....
Any way thanks very much for your help!
Related
When training the model the results depend on the sampling. In order to obtain something better you could repeat the training (in another randomly create training sample, using Ffolds, StratifiedKFold ... ), somehow aggregate the results and have this way a result that will be more robust that one create in a particular case alone. Question: is it already implemented in sklearn or similar?. Apologies is this is a straighforward question, I haven't see a simple solution.
I see that there is a function called cross_val_predict however my first impresion having a quick look to the source code is that it predecits as many times as trains and I would like to predicts only ones, so I can piclke the, somehow aggregate results, and predict later, instead of repeat the whole training thing again.
So far I think the best option are the ensemblers in sklearn.
I left here the solution I was using before. I am pretty sure could be improved (as mentioned before the Ensemblers in sklearn) are better. I have placed here https://github.com/rafaelvalero/aggreating_predictions_sklearn, where I have left a notebook with and example (using iris database), in case anyone can play around and see in details how could be done.
That solution will train models (in parallel, using joblib), pickle the trained model (a model from SKlearn), store the results (using joblib dump) and later would recover them to create predictions (in parallel, using joblib) that later are aggregated.
I have seen samples where the input data for the features are just any double values.
I am wondering if I need to normalize the input features for the MultilayerPerceptronClassifier to the range [-1,1] or [0,1].
I could not find that information in the Spark Documentations.
https://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier
Maybe it is a thing I have to decide depending of the results..
.. then I might want to use one of these:
Normalizer
StandardScaler
MinMaxScaler
MaxAbsScaler
Yes, you should normalize them. This is not specific to any framework, but a general good practice for neural networks. If you do not normalize inputs and outputs, you might run into learning issues.
Whatever [0,1 ] or [-1,1], both work equally well. There is probably little difference.
I have a task to compare two images and check whether they are of the same class (using Siamese CNN). Because I have a really small data set, I want to use keras imageDataGenerate.
I have read through the documentation and have understood the basic idea. However, I am not quite sure how to apply it to my use case, i.e. how to generate two images and a label that they are in the same class or not.
Any help would be greatly appreciated?
P.S. I can think of a much more convoluted process using sklearn's extract_patches_2d but I feel there is an elegant solution to this.
Edit: It looks like creating my own data generator may be the way to go. I will try this approach.
I want to predict my input price based on a list of questions/answers using azure machine learning.
I built one using the "bayesian linear regression" but it seems that it is predicting the price based on the prices i have in my dataset and not based on the Q/A.
Am i in the wrong path or am i missing something?
Any suggestion would be helpful.
Check the Q/A s that you using is not having missing values. If there's any missing values follow data preprocessing techniques to fill those.
What kind of answers do you have as inputs? (yes/no, numeric values, different textual answers, etc...) In my opinion numerical values and yes/no inputs makes your model more accurate.
Try different regression algorithms (https://azure.microsoft.com/en-us/documentation/articles/machine-learning-algorithm-cheat-sheet/) and check their accuracy.
you need to set features and label properly. if you publish your experiment in Gallery using unlisted mode and paste the link here, we can take a look.
Applying spark's logistic regression on a specific dataset requires to define a number of iterations. So far I've learned that outputting the result of the cost function on each iteration might be useful information to plot. It can be used to visualize how many iterations a function needs to converge to a minimum. I was wondering if there is a way to output such information in spark? Looping over a train() function with different iteration numbers, sounds like a solution that requires a lot of time on large datasets. It would be nice to know if there is a better one already built in. Thanks for any advice on this topic.
After you've trained a model (call it myModel) that has such a history, you can get the iteration-by-iteration history with
myModel.summary.objectiveHistory.foreach(...)
There's a nice example here in the Spark ML documentation -- once you know the right search terms.