Google Dataflow vs Apache Spark - apache-spark

I am surveying Google Dataflow and Apache Spark to decide which one is more suitable solution for our bigdata analysis business needs.
I found there are Spark SQL and MLlib in the spark platform to do structured data query and machine learning.
I wonder is there any corresponding solution in the Google Dataflow platform?

It would help if you could expand a bit on your specific use case(s). What are you trying to accomplish in relation to "Bigdata analysis"? The short answer... it depends :-)
Here are some key architectural points to consider in relation to Google Cloud Dataflow v. Spark and Hadoop MR.
Resource Mgmt: Cloud Dataflow is a completely on demand execution environment. Specifically - when you execute a job in Dataflow the resources are allocated on demand for that job only. There is no sharing/contention of resources across jobs. In comparison to a Spark or MapReduce cluster you would typically deploy a cluster of X nodes and then submit jobs and then tune the node resources across jobs. Of course you can build up and tear down these clusters, but the Dataflow model is geared towards hands free dev ops in relation to resource management. If want to optimize resource usage to job demands Dataflow is a solid model to control cost and nearly forget about resource tuning. If you prefer a multi-tenant style cluster I'd suggest you look at Google Cloud Dataproc as it provides the on demand cluster management aspects like Dataflow, but focused on class Hadoop workloads like MR, Spark, Pig, ...
Interactivity: Currently Cloud Dataflow does not provide an interactive mode. Meaning once you submit a job the work resources are bound to the graph that was submitted AND the majority of the data is loaded into resources as needed. Spark can be a better model if you want to load data into the cluster via in memory RDD's and then dynamically execute queries. The challenge is that as your data sizes and query complexity increases you will have to handle the devOps. Now if most of your queries can be expressed in SQL syntax you may want to look at BigQuery. BigQuery provides the "on demand" aspects of Dataflow and enables you to interactively execute queries over massive amounts of data e.g petabytes. The biggest advantage in my opinion of BigQuery is that you do not have think/worry about hardware allocation to deal with your data sizes. Meaning as your data sizes grow you don't have to think about hardware (memory and disk size) configuration.
Programming Model: Dataflow's programming model is functionally biased vs. a classic MapReduce model. There are many similarities between Spark and Dataflow in terms of API primitives. Things to consider: 1) Dataflow's primary programming language is Java. There is a Python SDK in the works. The Dataflow Java SDK in open sourced and has been ported to Scala. Today, Spark has more SDK surface choice with GraphX, Streaming, Spark SQL, and ML. 2) Dataflow is a unified programming model for batch and streaming based DAG development. The goal was to remove the complexity and cost switching when moving between batch and streaming models. The same graph can seamlessly run in either mode. 3) Today, Cloud Dataflow does not support converging/iterative based graph execution. If you need the power of something like MLib then Spark is the way to go. Keep in mind this is the state of things today.
Streaming & Windowing: Dataflow (building on top of the unified programming model) was architected to be a highly reliable, durable, and scalable execution environment for streaming. One of the key differences between Dataflow and Spark is that Dataflow enables you to easily process data in terms of its true event time vs. solely processing it at it's arrival time into the graph. You can window data into fixed, sliding, session or custom windows based on event time or arrival time. Dataflow also provides Triggers (applied to Windows) that enable you to control how you want to handle late arriving data. Net-net you dial in the level of correctness control to meet the needs of your analysis. For example, lets say you have a mobile game that interacts with a 100 edge nodes. These nodes create 10000's events second related to game play. Let's say a group of nodes can't communicate with your back end streaming analysis system. In the case of Dataflow - once that data does arrive - you can control how you'd like to handle the data in relation to your query correctness needs. Dataflow also provides the ability to upgrade your streaming jobs while they are in flight. For example, let's say you discover a logical bug in a transform. You can upgrade your in flight job without losing your existing Windowed state. Net-net you can keep you business running.
Net-net:
- if you are really primarily doing ETL style work (filtering, shaping, joining, ...) or batch style MapReduce Dataflow is a great path if you want minimal devOps.
- if you need to implement ML style graphs, go the Spark path and give Dataproc a try
- if you are doing ML and you first need to do ETL to clean up your training data implement a hybrid with Dataflow and Dataproc
- if you need interactivity Spark is a solid choice, but so is BigQuery if you are/can express your queries in SQL
- if you need to process your ETL and or MR jobs over streams, Dataflow is a solid choice.
So... what are you scenarios?

I've tried both :
Dataflow is still very young, the is no "out-of-the-box" solution for doing ML with it (even though you could implement algorithms in transforms), you could output the processes data to cloud storage and read it later with another tool.
Spark would be recommended but you would have to manage your cluster yourself.
However there is a good alternative: Google Dataproc
You can develop analysis tools with spark and deploy them with one command on your cluster, dataproc will manage the cluster itself without having to tweak the configuration.

I have built code using spark,DataFlow .Let me put my thoughts.
Spark/DataProc: I have used spark (Pyspark) a lot for ETL. You can use SQL and any programming language of your choice. Lot of functions are available (Including Window functions). Build your dataframe and write your transformation and it can be super fast. Once data is cached , any operation on the Dataframe will quick.
You can simply build hive external table on the GCS. Then you can use Spark for ETL and Load data into Big Query. This is for Batch processing.
For streaming you can use spark Streaming and load data into Big query.
Now if you have cluster allready then you have think whether to move to Google cloud or not. I found Data proc (Google Cloud Hadoop/Spark) offering is better as you don't have to worry many cluster managements..
DataFlow : It's know as apache beam. Here you can write your code in Java/Python or any other language. You can execute the code in any framework (Spark/MR/Flink).This is a unified model. Here you can do both batch processing and Stream Data processing.

google now offers both programming models- mapreduce and spark.
Cloud DataFlow and Cloud DataProc they are respectively

Related

Kubernetes Vs Spark Vs Spark on kubernetes

So I have a use case where I will stream about 1000 records per minute from kafka. I just need to dump these records in raw form in a no sql db or something like a data lake for that matter
I ran this through two approaches
Approach 1
——————————
Create kafka consumers in java and run them as three different containers in kubernetes. Since all the containers are in the same kafka consumer group, they would all contribute towards reading from same kafka topic and dump data into data lake. This works pretty quick for the volume of work load I have
Approach 2
——————————-
I then created a spark cluster and the same java logic to read from kafka and dump data in data lake
Observations
———————————-
Performance of kubernetes if not bad was equal to that of a spark job running in clustered mode.
So my question is, what is the real use case for using spark over kubernetes the way I am using it or even spark on kubernetes?
Is spark only going to rise and shine much much heavier work loads let’s say something of the order of 50,000 records per minute or cases where some real time processing needs to be done on the data before dumping it to the sink?
Spark has more cost associated to it so I need to make sure I use it only if it would scale better than kuberbetes solution
If your case is only to archive/snapshot/dump records I would recommend you to look into the Kafka Connect.
If you need to process the records you stream, eg. aggregate or join streams, then Spark comes into the game. Also for this case you may look into the Kafka Streams.
Each of these frameworks have its own tradeoffs and performance overheads, but in any case you save much development efforts using the tools made for that rather than developing your own consumers. Also these frameworks already support most of the failures handling, scaling, and configurable semantics. Also they have enough config options to tune the behaviour to most of the cases you can imagine. Just choose the available integration and you're good to go! And of course beware the open source bugs ;) .
Hope it helps.
Running kafka inside Kubernetes is only recommended when you have a lot of expertise doing it, as Kubernetes doesn't know it's hosting Spark, and Spark doesn't know its running inside Kubernetes you will need to double check for every feature you decide to run.
For your workload, I'd recommend sticking with Kubernetes. The elasticity, performance, monitoring tools and scheduling features plus the huge community support adds well on the long run.
Spark is a open source, scalable, massively parallel, in-memory execution engine for analytics applications so it will really spark when your load become more processing demand. It simply doesn't have much room to rise and shine if you are only dumping data, so keep It simple.

Flink or Spark? when streaming is not important

I recently have been comparing Spark and Flink for a brand new project. In this project, streaming feature is not so important. Batch analysis of ~90TB data is most important. Later I will apply ML and data mining in data analysis.
When searching around, I find lots of articles, presentations and videos claim Flink is the next generation analysis solution. Don't see much articles defending Spark. On the other hand, Spark is(or was?) very popular and widely deployed in very large production system.
My question is: For my use case, i.e. streaming is not important, shall I embrace Flink or start with Spark 2?
BTW, I've read through this thread. It doesn't give me a good answer.
Update, April 2018: Eventually we choose Spark. Apparently there are more questions to address other than performance. Cloudera, Hortonworks and HDInsight give a good level of confidence/proof in security, stability, scale, roadmap etc. to enterprise architects and security reviewers.
As per your requirement, Apache Spark is best. Both Spark and Flink are advanced big data processing technologies. In terms of features, stability, ecosystem, community, integrations with other systems and adaptability Spark is much ahead of Flink.
The major difference between Spark and Flink is: Spark is a batch processing system and it has streaming abstraction whereas Flink is stream data processing system for processing unbounded datasets and it has batch processing abstraction to process bounded datasets in batch style.
Spark is best for ETL, Machine Learning, Streaming, data warehousing and Graph processing on large volumes of datasets. Flink is best for stream processing on large and unbounded datasets.
[Apache-Flink]
[Apache-Spark]

Apache Spark & Machine Learning - Using in production

Im having some difficulties figuring out how to use spark's machine learning capabilities in a real life production environment.
What i want to do is the following:
Develop a new ml model using notebooks
Serve the learned model using REST api (something like POST - /api/v1/mymodel/predict)
Let say the ml training process is handled by a notebook, and once the model requirements are fulfilled it's saved into an hdfs file, to be later loaded by a spark application
I know i could write a long running spark application that exposes the api and run it on my spark cluster, but i don't think this is really a scalable approach, because even if the data transformations and the ml functions would run on the workers node, the http/api related code would still run on one node, the one on wich spark-submit is invoked (correct me if i'm wrong).
One other approach is to use the same long running application, but in a local-standalone cluster. I could deploy the same application as many times as i want, and put a load balancer in front of it. With this approach the http/api part is handled fine, but the spark part is not using the cluster capabilities at all (this could not be a problem, due to fact that it should only perform a single prediction per request)
There is a third approach wich uses SparkLauncher, wich wraps the spark job in a separate jar, but i don't really like flying jars, and it is difficult to retrieve the result of the prediction (a queue maybe, or hdfs)
So basically the question is: what is the best approach to consume spark's ml models through rest api?
Thank You
you have three options
trigger batch ML job via spark api spark-jobserver, upon client request
trigger batch ML job via scheduler airflow , write output to DB, expose DB via rest to client
keep structured-streaming / recursive functionon to scan input data source, update / append DB continuously, expose DB via rest to client
If you have single prediction per request, and your data input is constantly updated, I would suggest option 3, which would transform data in near-real-time at all times, and client would have constant access to output, you can notify client when new data is completed by sending notification via rest or sns, you could keep pretty small spark cluster that would handle data ingest, and scale rest service and DB upon request / data volume (load balancer)
If you anticipate rare requests where data source is updated periodically lets say once a day, option 1 or 2 will be suitable as you can launch bigger cluster and shut it down when completed.
Hope it helps.
The problem is you don't want to keep your spark cluster running and deploy your REST API inside it for the prediction as it's slow.
So to achieve real-time prediction with low latency, Here are a couple of solutions.
What we are doing is Training the model, exporting the model and use the model outside Spark to do the Prediction.
You can export the model as a PMML file if the ML Algorithm you used is supported by the PMML standards. Spark ML Models can be exported as JPMML file using the jpmml library. And then you can create your REST API and use JPMML Evaluator to predict using your Spark ML Models.
MLEAP MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark for batch-mode scoring or the MLeap runtime to power realtime API services. It supports multiple platforms, though I have just used it for Spark ML models and it works really well.

Spark goodness with Cassandra?

I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!
The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5
Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.
Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.

Big Data Analytics using Redshift vs Spark, Oozie Workflow Scheduler with Redshift Analytics

We want to do Big Data Analytics on our data stored in Amazon Redshift (currently in Terabytes, but will grow with time).
Currently, it seems that all our Analytics can be done through Redshift queries (and hence, no distributed processing might be required at our end) but we are not sure if that will remain to be the case in future.
In order to build a generic system that should be able to cater our future needs as well, we are looking to use Apache Spark for data analytics.
I know that data can be read into Spark RDDs from HDFS, HBase and S3, but does it support data reading from Redshift directly?
If not, we can look to transfer our data to S3 and then read it in Spark RDDs.
My question is if we should carry out our Data Analytics through Redshift's queries directly or should we look to go with the approach above and do analytics through Apache Spark (Problem here is that Data Locality optimization might not be available)?
In case we do analytics through Redshift queries directly, can anyone please suggest a good Workflow Scheduler to write our Analytics jobs with. Our requirement is to be able to execute jobs as a DAG (Job2 should execute only if Job1 succeeds, etc) and be able to schedule our workflows through the proposed Workflow Engine.
Oozie seems like a good fit for our requirements but it turns out that Oozie cannot be used without Hadoop. Does it make sense to set up Hadoop on our machines and then use Oozie Workflow Scheduler to schedule our Data Analysis jobs through Redshift queries?
You cannot access data stored on Redshift nodes directly (each via Spark), only via SQL queries submitted the cluster as a whole.
My suggestion would be to use Redshift as long as possible and only take on the complexity of Spark/Hadoop when you absolutely need it.
If, in the future, you move to Hadoop then Cascading Lingual gives you the option of running your existing Redshift analytics more or less unchanged.
Regarding workflow, Oozie is not a good fit for Redshift. I would suggest you look at Azkaban (true DAG) or Luigi (uses a Python DSL).

Resources