Manually create ml model in Pyspark - apache-spark

Is there a way to manually create a OneHotEncoderModel without learning it?
This is a quiet simple model and the only learned parameter (as far as I understand) is "categorySizes" which can be accessed using the _java_obj. But, I can not find a way to set it without using OneHotEncoder.fit(...) on a real dataset!
Sample code for what I want to achieve
model=OneHotEncoderModel(input='dayOfWeek',output='dayOfWeek_1hot',categorySizes=[7])
model.transform(data)

Related

Django Wagtail dynamically create form without new model

How would I allow my primary user to dynamically create forms they can issue to their end clients. Each of my primary users has their own unique information they would like to collect that I do not know before hand. I would like to avoid creating new models in code for their dynamic needs and then having to migrate the models.
I came across this which had an interesting response but it starts with disclaimer
The flexibility of Python and Django allow developers to dynamically create models to store and access data using Django’s ORM. But you need to be careful if you go down this road, especially if your models are set to change at runtime. This documentation will cover a number of things to consider when making use of runtime dynamic models.
Which leads me to believe a lot can go wrong.
However because I'm using wagtail I believe there is probably a way to use StructBlocks & StreamFields to accomplish it.
Any guidance would be helpful.
Wagtail provides a form builder module for this purpose.
I have two possible solutions for you, although it should be said that there is probably some library with Django that I don't know about that does this, but that being said.
Prompt your user for which fields they want and the field type.
Pass this as a dictionary to some function that would generate the HTML code for the form.
When this form is used, instead of worrying about storing the fields seperately, store a dictionary in the Models. There are two ways to do that here
Another way that you could do this, albeit more convoluted but more suited to your needs, is to use MongoDB for the database for Django instead. Because it is unstructured, it might be better suited for your use case. Instructions on using MongoDB for Django are here

When using HuggingFace's Transformers library to run the GLUE benchmark, is it possible to load my own model from a PVC without using ModelHub?

So HuggingFace's transformers library has a nice script here which one can use to test a model which exists on their ModelHub against the GLUE benchmark. However, I have a model which I wish to test whose weights are stored in a PVC on my university's cluster, and I am wondering if it is possible to load directly from there, and if so, how.
Otherwise, could anyone point me in the direction of something which could do what I wish to do? Many thanks in advance!

Loess Spark/Pyspark

I was wondering if the LOESS (locally estimated scatterplot smoothing) regression was a function built-in Spark/PySpark (I'm more interested in the PySpark answer but both would be interesting).
I did some research and couldn't find one so decided to try and code it myself using pandas-udf functions but while doing it, when I displayed the scatter_plot of the manufactured data I created to begin testing my algo, Azure Databricks (on which I'm coding) proposed to me to automatically compute/display the LOESS of my dataset :
So maybe there is indeed a built-in LOESS that I just couldn't find ? If not (and Databricks is the only one responsible for this), is there any way to access the result of databricks's LOESS computation/access the function Databricks is using to do that ?
Thank you in advance :)

Building a collaborative filtering recommendation engine using Spark mlLib

I am trying to build a recommendation engine based on collaborative filtering using apache Spark. I have been able to run the recommendation_example.py on my data, with quite good result. (MSE ~ 0.9). Some of the specific questions that I have are:
How to make recommendation for the users who have not done any activity on the site. Isn't there some API call for popular items, which would give me the most popular items based on user actions. One way to do is to identify the popular items by ourselves, and catch the java.util.NoSuchElementException exception, and return those popular items.
How to reload the model, after some data has been added in the input file. I am trying to reload the model using another function, which tries to save the model, but it gives error as org.apache.hadoop.mapred.FileAlreadyExistsException. One way to do is to listen for the incoming data on a parallel thread, save it using model.save(sc, "target/tmp/<some target>") and then reload the model after significant data has been received. I am lost here, how to achieve that.
It would be very helpful, if I could get some direction here.
For the first part, you can find item_id, Number of times that item_id appeared. You can use map and reduceByKey functions of spark for that. After that find the top 10/20 items having max count. You can also give the weightage depending on recency of the items.
For the second part, you can save the model with new name every time. I generally create a folder name on the go using the current date and time and use the same name to reload the model from the saved folder. You will always have to train the model again, using past data and the new data received and then use the model to predict.
Independent of using platforms like Spark, there are some very good techniques(for ex. non-negative matrix factorization) of link prediction which predicts link between 2 sets.
Other very effective(and good) techniques of recommendations are:-
1. Thompson Sampling, 2.MAB (Multi Arm Bandits). A lot depends on the raw dataset. How is your raw dataset distributed. I would recommend to apply above methods on 5% raw dataset, build a hypothesis, use A/B testing, predicts links and move forward.
Again, all these techniques are independent of platform. I would also recommend of moving from scratch instead of using platforms like spark which are only useful for large datasets. You can always move to these platforms in future for scalability.
Hope it helps!

dynamic ORM in node.js+mongodb

Is it possible to create a model where the relationships are dynamically generated by the application?
I saw the KeystoneJS project that does a nice job of defining the model (see: http://keystonejs.com/docs/database/#relationship-definitions)
But these need to be defined by node, I'm interested in creating these within the application. Are there any ORMs or framework projects that already do that? I I've seen frameworks like the MODxCMS that allow users to create additional fields, by putting everything from the custom (templatevar) values into one table. think mongodb would be great for setting this up without this single table approach.
Any idea how to go about setting this kind of system up? I'm not sure where to start.
I guess mongoose might help you here. And you may want to have a look at mongo-relation too.

Resources