Extracting fp tree from Pyspark FPGrowth MLlib model - apache-spark

Has anybody tried doing this? It is possible to extract frequent item-sets and association rules, but tree? Maybe even how to reconstruct it if it is not being internally used.
Link to the docs:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.fpm.FPGrowth.html
Any tips, links etc. will be greatly appreciated. Thanks!

Related

Get PCA matrix with feature names in pyspark

Thanks for reading.
I need some help here on something which should be simple but doesn't seem to be.
I'm running PCA in Pyspark and I'm looking to find out the names of the features it is selecting as part of the dimension reduction.
I get to the dense matrix but I'm not sure how to get to something interpretable. Ideally I'd like a spark dataframe with feature names included.
I'm having the same problems with the feature selectors within spark it seems theres no though gone into interpretability once these techniques are applied.
Who knew you might want to know which features were being selected....
Any way thanks very much for your help!

Join decision trees models into one decision tree

I have five decision trees for five datasets. I want to combine them all into one decision tree.
I believe It is something similar to bagging technique. It would be great if experts post few links that are helpful. I am not looking to retrain the datasets, just combining the trees. Help is appreciated. TIA :)

Image Augmentation of Siamese CNN

I have a task to compare two images and check whether they are of the same class (using Siamese CNN). Because I have a really small data set, I want to use keras imageDataGenerate.
I have read through the documentation and have understood the basic idea. However, I am not quite sure how to apply it to my use case, i.e. how to generate two images and a label that they are in the same class or not.
Any help would be greatly appreciated?
P.S. I can think of a much more convoluted process using sklearn's extract_patches_2d but I feel there is an elegant solution to this.
Edit: It looks like creating my own data generator may be the way to go. I will try this approach.

Extract the “path” of a data point through a decision tree in sklearn

I'm working with decision trees in python's scikit learn. Unlike many use cases for this, I'm not so much interested in the accuracy of the classifier at this point so much as I am extracting the specific path a data point takes through the tree when I call .predict() on it. Has anyone done this before?
I know this can be done in R using rpart, however trying hard to do same using Python. Any pointers would be helpful

how to reuse the classifier in the pickled pipeline in sklearn?

I have read the answer in another post https://stackoverflow.com/a/25794131/4566048
the classifier is pickled, how about the TfidfVectorizer? how can I use it from the pickled pipeline? since I need it to transform my feature vector, I still need to use it right?
After some digging around, I seem to have solved the problem. I will answer my own question here in case it can help anyone with same doubt in the future.
I found that only save the classifier is not enough, CountVectorizer and TfidfTransformer which are used to do the feature vector extraction need to be saved as well for it to work.
hope that helps!

Resources