Design application using Apache Spark - apache-spark

Its a bit architectural kind of question. I need to design an application using Spark and Scala as the primary tool. I want to minimise the manual intervention as much as possible.
I will receive a zip with multiple files having different structures as an input at a regular interval of time say, daily. I need to process it using Spark . After transformations need to move the data to a back-end database.
Wanted to understand the best way I can use to design the application.
What would be the best way to process the zip ?
Is the Spark Streaming can be considered as an option looking at the frequency of the file ?
What other options should I take into consideration?
Any guidance will be really appreciable.

Its a broad question, there are batch options and stream options not sure your exact requirements. you may start your research here: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html

Related

Why doesn't spark add performance configurations by default?

I was reading for some spark optimization techniques and found some configurations that we need to enable,such as
spark.conf.set("spark.sql.cbo.enabled", true)
spark.conf.set("spark.sql.adaptive.enabled",true)
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled",true)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled",true)
Can I enable this for all my spark jobs, even if I don't need it? what are the downsides of including it? and why doesn't spark provide this performance by default? When should I use what?
It does not turn on these features as they have a little more risk than not using them. To have the most stable platform they're not enabled by default.
One thing that is called out and called out by Databricks is that CBO heavily rely on table statistics. So you need to regularly update these when your table statistics change significantly. I have hit edge cases where I had to remove CBO for my queries to complete. (I believe that this was related to a badly calculated map side join.)
The same is true of spark.sql.adaptive.skewJoin.enabled. This only helps if the table stats are up to date and you have skew. It could make your query take longer with out of data stats.
spark.sql.adaptive.coalescePartitions.enabled also looks great but should be used for specific types of performance tuning. There are knobs and levers here that could be used to drive better performance.
There settings in general are helpful might actually cover up a problem that you might want to be aware of. Yes, they are useful, yes you should use them. Perhaps you should leave them off until you need them. Often you get better performance out of tuning the algorithm of your spark job by understanding it and what it's doing. If you turn all this on by default you may not have as in-depth understanding or the implication of your choices.
(Java/Python do not force you to manage memory. This lack of understanding of the implications of what you use and its effect on performance is frequently learned the hard way with a performance issue that sneaks up on new developers.) This is a similar lesson but slight more sinister, as now they're switches to auto fix your bad queries, will you really learn to be an expert without understanding their value?
TLDR: Don't turn these on until you need them, or turn them on when you need to do something quick and dirty.
I hope this helps your understanding.

Distributed Rules Engine

We have been using Drools engine for a few years now, but our data has grown, and we need to find a new distributed solution that can handle a large amount of data.
We have complex rules that look over a few days of data and that why Drools was a great fit for us because we just had our data in memory.
Do you have any suggestions for something similar to drools but distributed/scalable?
I did perform a research on the matter, and I couldn't find anything that answers our requirement.
Thanks.
Spark provides a faster application of Drools rules to the data than traditional single-node applications. The reference architecture for the Drools - Spark integration could be along the following lines. In addition, HACEP is a Scalable and Highly Available architecture for Drools Complex Event Processing. HACEP combines Infinispan, Camel, and ActiveMQ. Please refer to the following article for on HACEP using Drools.
You can find a reference implementation of Drools - Spark integration in the following GitHub repository.
In the first place, I can see for huge voluminous data as well we can apply Drools efficiently out of my experiences with it (may be some tuning is needed based on your kind of requirements). and is easily integrated with Apache Spark. loading your rule file in memory for spark processing will take minute memory... and Drools can be used with spark streaming as well as spark batch jobs as well...
See my complete article for your reference and try.
Alternative to it might be ....
JESS implements the Rete Engine and accepts rules in multiple formats including CLIPS and XML.
Jess uses an enhanced version of the Rete algorithm to process rules. Rete is a very efficient mechanism for solving the difficult many-to-many matching problem
Jess has many unique features including backwards chaining and working memory queries, and of course Jess can directly manipulate and reason about Java objects. Jess is also a powerful Java scripting environment, from which you can create Java objects, call Java methods, and implement Java interfaces without compiling any Java code.
Try it yourself.
Maybe this could be helpful to you. It is a new project developed as part of the Drools ecosystem. https://github.com/kiegroup/openshift-drools-hacep
It seems like Databricks is also working on Rules Engine. So if you are using the Databricks version of Spark, something to look into.
https://github.com/databrickslabs/dataframe-rules-engine
Take a look at https://www.elastic.co/blog/percolator
What you can do is convert your rule to an elasticsearch query. Now you can percolate your data against the percolator which will return you the rules that match the provided data

How can I know if Apache Spark is the right tool?

Just wondering, is there somewhere a list of questions to ask ourselves in order to know whether Spark is the right tool or not ?
Once again I spent part of the week implementing a POC with Apache Spark in order to compare the performance against pure python code and I was baffled when I saw the 1/100 ratio (in favor of python).
I know that Spark is a "big data" tool and everyone keeps saying "Spark is the right tool to process TB/PB of data" but I think that is not the only thing to take into account.
In brief, my question is, when given small data as input, how can I know if the computing will be consuming enough so that Spark can actually improve things ?
I'm not sure if there is such a list, but if there was, the first question would probably be
Does your data fit on a single machine?
And if the answer is 'Yes', you do not need Spark.
Spark is designed to process lots of data such that it cannot be handled by a single machine, as an alternative to Hadoop, in a fault-tolerant manner.
There are lots of overheads, such as fault-tolerance and network, associated with operating in a distributed manner that cause the apparent slow-down when compared to traditional tools on a single machine.
Because Spark can be used as a parallel processing framework on a small dataset, does not mean that it should be used in such a way. You will get faster results and less complexity by using, say, Python, and parallelizing your processing using threads.
Spark excels when you have to process a dataset that does not fit onto a single machine, when the processing is complex and time-consuming and the probability of encountering an infrastructure issue is high enough and a failure would result in starting again from scratch.
Comparing Spark to native Python is like comparing a locomotive to a bicycle. A bicycle is fast and agile, until you need to transport several tonnes of steel from one end of the country to the other: then - not so fun.

Statistics with HDFS data

In our company, we use HDFS. So far everything works out and we can extract data by using queries.
In the past I had worked a lot with Project R. It was always great for my analyses. So I checked Project R and the support of HDFS (rbase, rhdfs,...).
Nevertheless, I am a little bit confused since I found tons of tutorials where they do analyses with simple data saved in CSV files. Don't get me wrong. That's fine but I want to ask if there is a possibility to write queries, extracting the data and do some statistics in one run.
Or in other words: When we talk about statistics for data stored in HDFS, how do you handle this?
Thanks a lot and hopefully some of you can help me out to see pros and cons for my question.
All the best -
Peter
You might like to check out Apache Hive and Apache Spark. Although there are many other option but I am not sure whether you are asking how to work on the data from hdfs when the data is not handed down to you in a file.

Examples of simple stats calculation with hadoop

I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my data, i.e. arithmetic mean and variance.
I've been googling for a while, but maybe I'm not using the right keywords and I haven't really found anything which is a good primer for doing this sort of calculation, so I thought I would ask here.
Can anyone point me to some good samples of how to calculate mean and variance using hadoop, and/or provide some sample code.
Thanks
Pig latin has an associated library of reusable code called PiggyBank that has numerous handy functions. Unfortunately it didn't have variance last time I checked, but maybe that has changed. If nothing else, it might provide examples to get you started on your own implementation.
I should note that variance is difficult to implement in a stable way over huge data sets, so take care!
You might double check and see if your clustering code can drop into Cascading. Its quite trivial to add new functions, do joins, etc with your existing java libraries.
http://www.cascading.org/
And if you are into Clojure, you might watch these github projects:
http://github.com/clj-sys
They are layering new algorithms implemented in Clojure over Cascading (which in turn is layered over Hadoop MapReduce).

Resources