Spark Storm or Flink - Big Data analysis - apache-spark

Can anyone recommend me which technology can be explored if I am having a large data set in Cassandra table (3 node cluster) and I need to perform a sum operation on records received on daily basis. The count so calculated needs to be updated in a MySQL table.
Steps to perform -
1. Fetch Ids from MY SQL table
2. Run Sum operation from Cassandra table
3. Insert/update the calculated sum value in MYSQL table
Currently I am using plain Java to perform these tasks using SQL and CQL queries but its very slow and in future data will be growing exponentially.
Can anyone suggest technologies that can be explored to get this task accomplish in fastest possible way and lowest development time.

There's not much to recommend, it depends only on the task you have and your own preferences.
Apache Storm is a streaming engine, it would be good if you want to process stream of entries, not a batch of data like in your case.
Both Apache Spark and Apache Flink will allow you to perform batch job once a day or make a streaming application that will calculate results from one day.
I prefer Apache Spark, as it has unified API for batch and streaming jobs (so you can easily change code from batch to streaming) and strong community support. Apache Flink supports real time streaming, however it's not necessary in your case.
However, you should look and these two frameworks on your own and choose this framework, which looks better for you. In my opinion both of them will be ok

Related

What's the best way to rate limit a spark application

I have an application does the following:
Reads URLs from a Hive table
Creates HTTP requests from those URLs, hits a server with them and parses the responses
Writes the parsed responses to another Hive table
I would like to rate-limit the URLs sent to the server. Currently, to solve the problem I have added a sleep time after every request is sent to the server. The sleep time is calculated as: (no. of executors) * (no. of cores available for each executor) / (RPS intended)
This for some reason does not do any rate limiting, so I am looking for alternatives. From what I have found from this post, it seems Spark Streaming could be a good alternative if I could use the input Hive table as a streaming source and rate limit the reading.
I have read the documents but can not figure out if a Hive table can be a streaming source. A file can be a streaming source, so I can always read the data from the hive table, store it in a file and then use that as a streaming source but I was wondering if it was possible to avoid this indirect route.
You aren't really using the right tool for the job here. Yes, spark reads from hive but so do a lot of other tools. Spark is made to do batch processing, weather it's steaming or processing. Rate control would require custom code.
You might look at other open source tools, like NIFI that know how to work with hive and also understand hive. Here's a good discussion on how to control rate flow with Nifi.
Or look at Nutch which was made to scrape the internet into hadoop.
If you wanted to abuse spark to do this, you might be able to do something with foreachPartitions and repartitioning the partitions up into smaller chunks, and reducing the number of cores/executors, so that the entire job took longer to process... but really your anti-optimizing at that point... again not really a good look. Possible but not really a good use of Spark.

best failsafe strategy to store result of spark sql for structured streaming and OLAP queries

I would like to store result of continuous queries running against streaming data in such a manner so that results are persisted into distributed nodes to ensure failover and scalability.
Can Spark SQL experts please shed some light on
- (1) which storage option I should choose so that OLAP queries are faster
- (2) how to ensure data available for query even if one node is down
- (3) internally how does Spark SQL store the resultset ?
Thanks
Kaniska
It depends what kind of latency you can afford.
One way is to persist the result into HDFS/Cassandra using Persist() API. If your data is small then cache() of each RDD should give you a good result.
Store where your spark executors are co-located. For example:
It is also possible to use Memory based storage like tachyon to persist your stream (i.e. each RDD of your stream) and query against it.
If latency is not an issue then persist(MEMORY_OR_DISK_2) should give you what you need. Mind you performance is a hit or miss in that scenario. Also this stores the data in two executors.
In other cases if your clients are more comfortable in OLTP like database where they just need to query the constantly updating result you can use conventional database like postgres or mysql. This is a preferred method among many as query time is consistent and predictable. If the result is not update heavy but partitioned (say by time) then Greenplum like systems are also a choice.

Cassandra Loading Options

I have deployed a 9 node DataStax Cluster in Google Cloud. I am new to Cassandra and not sure how generally people push the data to Cassandra.
My requirement is read the data from flatfiles and RDBMs table and load into Cassandra which is deployed in Google Cloud.
These are the options I see.
1. Use Spark and Kafka
2. SStables
3. Copy Command
4. Java Batch
5. Data Flow ( Google product )
Is there any other options and which one is best.
Thanks,
For flat files you have 2 most effective options:
Use Spark - it will load data in parallel, but requires some coding.
Use DSBulk for batch loading of data from command line. It supports loading from CSV and JSON, and very effective. DataStax's Academy blog just started a series of the blog posts on DSBulk, and first post will provide you enough information to start with it. Also, if you have big files, consider to split them into smaller ones, as it will allow DSBulk to perform parallel load using all available threads.
For loading data from RDBMS, it depends on what you want to do - load data once, or need to update data as they change in the DB. For first option you can use Spark with JDBC source (but it has some limitations too), and then saving data into DSE. For 2nd, you may need to use something like Debezium, that supports streaming of change data from some databases into Kafka. And then from Kafka you can use DataStax Kafka Connector for submitting data into DSE.
CQLSH's COPY command isn't as effective/flexible as DSBulk, so I won't recommend to use it.
And never use CQL Batch for data loading, until you know how it works - it's very different from RDBMS world, and if it's used incorrectly it will really make loading less effective than executing separate statements asynchronously. (DSBulk uses batches under the hood, but it's different story).

Baseline for measuring Apache Spark jobs execution times

I am fairly new to Apache Spark. I have been using it for several months, but this is my first project that uses it.
I use Spark to compute dynamic reports from data, stored in a NoSQL database (Cassandra). So far I have created several reports and they are computed correctly. Inside them I use DataFrame .unionAll(), .join(), .count(), .map(), etc.
I am running a 1.4.1 Spark cluster on my local machine with the following setup:
export SPARK_WORKER_INSTANCES=6
export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=1g
I have also populated the database with test data which is around 10-12k records per table.
By using the driver's web UI (http://localhost:4040/), I have noticed that the jobs are taking 40s-50s to execute, so lately I have been researching ways to tune Apache Spark and the jobs.
I have configured Spark to use the KryoSerializer, I have set the spark.io.compression.codec to lzf, I have optimized the jobs as much as I can and as much as my knowledge allows me to.
This led to the jobs taking 20s-30s to compute (which I think is a good improvement). The problem is that because this is my first Spark project, I have no baseline to compare the jobs times, so I have no idea if the execution is slow or fast and whether there is some problem in the code or with the Spark config.
What is the best way to proceed? Is there a graph or benchmark that shows how much time an action with N data should take?
You have to use hive . On top of hive you can put spark . After doing this create temp table in hive for Cassandra table you can perform all type of aggregation and filtering . After doing this you can use hive jdbc connection to get result . It will give fast result .

Comparing Cassandra's CQL vs Spark/Shark queries vs Hive/Hadoop (DSE version)

I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. The question is in which use case we would pick a specific solution instead of the other.
I will share a few thoughts based on my experience. But, if possible for you, please let us know about your use-case. It'll help us in answering your queries in a better manner.
1- If you are going to have more writes than reads, Cassandra is obviously a good choice. Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. But if you need to perform operations like JOIN and GROUP BY, even though CQL solves primitive GROUP BY use cases through write time and compact time sorts and implements one-to-many relationships, CQL is not the answer.
2- Spark SQL (Formerly Shark) is very fast for the two reasons, in-memory processing and planning data pipelines. In-memory processing makes it ~100x faster than Hive. Like Hive, Spark SQL handles larger than memory data types very well and up to 10x faster thanks to planned pipelines. Situation shifts to Spark SQL benefit when multiple data pipelines like filter and groupBy are present. Go for it when you need ad-hoc real time querying. Not suitable when you need long running jobs over gigantic amounts of data.
3- Hive is basically a warehouse that runs on top of your existing Hadoop cluster and provides you SQL like interface to handle your data. But Hive is not suitable for real-time needs. It is best suited for offline batch processing. Doesn't need any additional infra as it uses underlying HDFS for data storage. Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP.
Note : Spark SQL emulates Apache Hive behavior on top of Spark, so it supports virtually all Hive features but potentially faster. It supports the existing Hive Query language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts.
But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. I could just suggest based on your questions.
Hope this answers some of your queries.
P.S. : The above answer is based on solely my experience. Comments/corrections are welcome.
There is a very good effort for benchmark documented here - https://amplab.cs.berkeley.edu/benchmark/

Resources