Is MLeap actually a serialization "format"? - apache-spark

I began to work with MLeap as a serialization tool that allows to save model in Spark or scikit-learn and load it for inference using MLeap Runtime. It works well.
Now my purpose is to load a model saved using MLeap into my Java code, into my own structures, without MLeap Runtime. I investigated a bit and haven't found any "format definitions" of "schema", only examples that show how some serialized models look like. From that perspective it looks like MLeap is just a serialization/deserialization tool, not a "format" as it's declared on the main page of documentation.
So, is MLeap a "format" or just a serialization tool? Can I found a format definition or schema somewhere?
And again, my purpose is to understand if it's possible to write a custom serialization/deserialization tool for MLeap format or the only option is to use MLeap tools for that?

I would say, that Mleap is a framework to put models to production without the overhead of the frameworks in which you trained them. This leads to the desired low latency. De-/Serialization is definetly an important part of that and you in fact got some freedom to store your pipelines.
I recommend having a look at the bundles you create (zip files) using Mleap which contain the exported pipelines. Most of the serialisations are easy to comprehend: a logistic regression is contained in a jsonfile for example that has the identifier of the pipeline element and the coefficients. Basically what defines the logistic regression model.

Related

How can I get hyperparameters about FashionMNIST from dataset?

I am new to deep-learning and I will do something on fashion-mnist.
And I come to found that the hyperparameter of parameter "transform" can be callable and optional and I found that it can be ToTensor().
What can I use as a transform's hyperparameter? Where do I find it?
Actually, I am watching :
https://pytorch.org/vision/stable/datasets.html#fashion-mnist
But I got no answer about it. Help me please. Thank you.
As you noted, transform accepts any callable. As there are a lot of transformations that are commonly used by the broader community, many of them are already implemented by libs such as torchvision, torchtext, and others. As you intend to work on FashionMNIST, you can see a list of vision-related transformations in torchvision.transforms:
Transforms are common image transformations. They can be chained together using Compose. Most transform classes have a function equivalent: functional transforms give fine-grained control over the transformations. This is useful if you have to build a more complex transformation pipeline (e.g. in the case of segmentation tasks).
You can check more transformations in other vision libs, such as Albumentations.

What can I do with DSL languages generated inside JetBrains MPS?

I've just started a couple of hours ago reading about DSL modeling.
But right now, I'm tied to using the JetBrains MPS IDE or it's plugin for JetBrains Intellij Idea and I'd like to know how can I export those DSL models to something available to use for e.g. console applications or whatever (in case it's possible or it makes sense).
You can do several things already in MPS without exporting the models:
Analyze the models to check for errors, business rule violations or inconsistencies.
Interpret the models then display the result of the interpretation in MPS directly. Useful if you implement a specification and an example/test of that specification, then you can run tests in MPS and show the results as green/red highlight, for example.
Define a generator to translate the model into text (executable code or input for a tool such as Liquibase to create database schemas for example).
If you're looking to export your data from MPS for use in a different application there are two approaches I would
recommend:
The simplest way: NodeSerializer from MPS-extensions. I have more details on how to use it in a blog post. This lets you quickly export your data in a rather nice XML structure.
The most flexible approach: writing a custom exporter by using the MPS Open
API to recursively traverse a node
tree. You can output any format you want (XML, JSON, YAML, etc.) and customize the output as you like.
Here are two more approaches that you could be considering but that I would NOT recommend:
Accessing the model (*.mps) files directly. While they are already in XML format, their structure is adapted to
MPS' needs. It is normalized, meaning that a given piece of information is generally only stored once, and it also
encodes node IDs in a particular way to save space. The format is also undocumented and could change in the future
(although it hasn't changed for the past several years).
Using the MPS generator to convert your DSL to MPS' built-in XML language, jetbrains.mps.core.xml. I don’t recommend using the MPS generator because the generator’s sweet spot is translating between two different MPS languages, e.g. from your custom DSL to Java. If you try writing generator rules to convert anything to XML you would hit a few problems that are possible to overcome but totally unnecessary.
You can define a generator which transforms a sentence (file, AST) of your language into another MPS language. The target language must exist in MPS first.
Alternatively, you could generate text with the TextGen aspect, but that is more suitable to just print the textual representation of your language. If you would like something more sophisticated (like generating text code of another language), you can use plaintextgen language from MPS-extensions or mbeddr.platform.
If you want to input (import) a textual program into MPS , you can code a paste handler where you could put your parser, or you can change the format in which the AST is stored (from XML to maybe directly your language, but this would again require a parser to read) with custom persistence.
I am currently working on a solution which enables to import an MPS language from a YAJCo model (model-based parser generator, where the input is not a grammar, but Java classes representing the semantic model). Then you can import a sentence (file) which creates and populates a model (AST). From the program in MPS you can generate Java source code which fills the original Java classes. So if you want a textual MPS language and use the IDE but then export the AST into Java objects you can use, maybe YtM is for you.

What are the differences between torch.jit.trace and torch.jit.script in torchscript?

Torchscript provides torch.jit.trace and torch.jit.script to convert pytorch code from eager mode to script model. From the documentation, I can understand torch.jit.trace cannot handle control flows and other data structures present in the python. Hence torch.jit.script was developed to overcome the problems in torch.jit.trace.
But it looks like torch.jit.script works for all the cases, then why do we need torch.jit.trace?
Please help me understand the difference between these two methods
If torch.jit.script works for your code, then that's all you should need. Code that uses dynamic behavior such as polymorphism isn't supported by the compiler torch.jit.script uses, so for cases like that, you would need to use torch.jit.trace.

Distributed Rules Engine

We have been using Drools engine for a few years now, but our data has grown, and we need to find a new distributed solution that can handle a large amount of data.
We have complex rules that look over a few days of data and that why Drools was a great fit for us because we just had our data in memory.
Do you have any suggestions for something similar to drools but distributed/scalable?
I did perform a research on the matter, and I couldn't find anything that answers our requirement.
Thanks.
Spark provides a faster application of Drools rules to the data than traditional single-node applications. The reference architecture for the Drools - Spark integration could be along the following lines. In addition, HACEP is a Scalable and Highly Available architecture for Drools Complex Event Processing. HACEP combines Infinispan, Camel, and ActiveMQ. Please refer to the following article for on HACEP using Drools.
You can find a reference implementation of Drools - Spark integration in the following GitHub repository.
In the first place, I can see for huge voluminous data as well we can apply Drools efficiently out of my experiences with it (may be some tuning is needed based on your kind of requirements). and is easily integrated with Apache Spark. loading your rule file in memory for spark processing will take minute memory... and Drools can be used with spark streaming as well as spark batch jobs as well...
See my complete article for your reference and try.
Alternative to it might be ....
JESS implements the Rete Engine and accepts rules in multiple formats including CLIPS and XML.
Jess uses an enhanced version of the Rete algorithm to process rules. Rete is a very efficient mechanism for solving the difficult many-to-many matching problem
Jess has many unique features including backwards chaining and working memory queries, and of course Jess can directly manipulate and reason about Java objects. Jess is also a powerful Java scripting environment, from which you can create Java objects, call Java methods, and implement Java interfaces without compiling any Java code.
Try it yourself.
Maybe this could be helpful to you. It is a new project developed as part of the Drools ecosystem. https://github.com/kiegroup/openshift-drools-hacep
It seems like Databricks is also working on Rules Engine. So if you are using the Databricks version of Spark, something to look into.
https://github.com/databrickslabs/dataframe-rules-engine
Take a look at https://www.elastic.co/blog/percolator
What you can do is convert your rule to an elasticsearch query. Now you can percolate your data against the percolator which will return you the rules that match the provided data

Mahout recommender, Flink, Spark MLLib, 'gray box'

I'm new to Mahout-Samsara and I'm trying to understand the "domain" of the different projects and how they relate to each other.
I understand that Apache Mahout-Samsara deprecates many MapReduce algorithms, and that things will be based on Apache Flink or Spark or other engines like h2o ( based on the introduction of the "Apache Mahout: Beyond MapReduce" book).
I want to try some recommender algorithms but I'm not so sure about what's new and what's 'deprecated'. I see the following links,
Mahout Recommender overview
Mahout Coocurrence intro
referring to spark-rowsimilarity and spark-itemsimilarity. (I don't understand if these links are talking about an off-the-self algorithm or a design... it's probably a design because they are not listed at mahout dot apachedot org/users/basics/algorithms.html ... anyways...).
And at the same time, Apache Flink (or is it Spark MLLib?) implements the ALS algorithm for recommendation (Machine Learning for Flink and Spark MLlib).
General questions:
Is it that these algorithms from mahout.apache.org are deprecated and they are being migrated to Flink / Spark MLLib, so that the ML library and support at Flink / Spark MLLib will grow?
Is Flink / Spark MLLib intended to be more an engine or engine + algorithm library with good support for the algorithms?
Other links to help the conversation:
Flink Vision and
Roadmap
Mahout Algorithms
Specific question:
I want to try a recommender algorithm as a 'gray box' (part 'black box' because I don't want to get too deep into the math, part 'white box' because I want to tweak the model and the math to the extent that I need to improve results).
I'm not interested in other ML algorithms yet. I thought about starting with what's off-the-shelf and then changing the ALS implementation of MLLib. Would that be a good approach? Any other suggestions?
I've been working on ML on Flink for a while now and I'm doing my fair load of scouting and I'm monitoring what is going in this ecosystem. What you're asking implies a rational coordination between projects that simply doesn't exist. Algorithms get reimplemented over and over and for what I see, it's easier to do so than integrate with different frameworks. Samsara it's actually one of the most portable solutions but it's good just for a few applications.
Is it that these algorithms from mahout.apache.org are deprecated and they are being migrated to Flink / Spark MLLib, so that the ML library and support at Flink / Spark MLLib will grow?
This as I said, would require a coordination between projects that it's not a thing.
Is Flink / Spark MLLib intended to be more an engine or engine + algorithm library with good support for the algorithms?
They should be the first thing in a ideal ecosystem, but they will keep building their own ML libraries for commercial purposes: computing engines with ML libraries out of the box sell really well. Actually I'm working full time on Flink ML not because I believe it's necessarily the best way to do ML on Flink, but because, right now, it's something Flink requires to be sold in many environments.
#pferrel suggested PredictionIO that is an excellent software but there are many alternatives under development: for example Beam is designing a Machine Learning API to generalize over different runners' implementations (Flink, Spark, H2O, and so on). Another alternative are data analysis platforms like Knime, RapidMiner and others, that can build pipelines over Spark or other Big Data tools.
Spark-itemsimilarity and spark-rowsimialrity are command line accessible drivers. They are based on classes in Mahout-Samsara. The description of these is for running code supported since v0.10.0.
The link https://mahout.apache.org/users/basics/algorithms.html shows which algos are supported on which "compute-engine". Anything in the "Mapreduce" column is in line for deprecation.
That said, Mahout-Samsara is less a collection of algorithms than pre-0.10.0 Mahout was. It now has a R-like DSL, which includes generalized tensor math, from which most of the Mahout-Samsara algos have been built. So think of Mahout as a "roll-you-own math and algorithm" tool. But every product is scalable on your choice of compute engine. The engine's themselves are also available natively so you don't have to use only the abstracted DSL.
Regarding how Mahout-Samsara relates to MLlib or any algo lib, there will be overlap and either can be used in your code interchangeably.
Regarding recommenders, the new SimilarityAnalysis.cooccurrence implements a major innovation, called cross-occurrence that allows a recommender to ingest almost anything known about a user or user's context and even accounts for item-content similarity. The Mahout-Samsara part is the engine for Correlated Cross-Occurrence. See some slides here describing the algorithm: http://www.slideshare.net/pferrel/unified-recommender-39986309
There is a full, end-to-end implementation of this using the PredictionIO framework (PIO itself is now a proposed Apache incubator project) that is mature and can be installed using these instructions: https://github.com/actionml/cluster-setup/blob/master/install.md

Resources