Does spark application equivalent to user code? - apache-spark

i want to know if spark application concept is equivalent to "user code". i mean spark application=user code or script that use the framework spark(like PySpark in python ) ?

If I understand your question correctly:
In general - your spark scripts are the same as regular code.
But there are some differences. When you run spark most of your code is evaluated lazily and executed only on actions (like collect, show, count, etc.). But before execution under the hood these operations are optimized and might not be run at the same order as they are in script. In example - filters are shifted up the stream.
This course is good for general understanding: https://courses.edx.org/courses/BerkeleyX/CS100.1x/1T2015/course/ (of course there are other and newer resources).
And talking about Py-Spark - it is just an API to Spark framework and you might have code that is run by Python and then call Py-Spark for data processing.

Related

What is the difference between Map Reduce and Spark about engine in Hive?

It looks like there are two ways to use spark as the backend engine for Hive.
The first one is directly using spark as the engine. Like this tutorial.
Another way is to use spark as the backend engine for MapReduce. Like this tutorial.
In the first tutorial, the hive.execution.engine is spark. And I cannot see hdfs involved.
In the second tutorial, the hive.execution.engine is still mr, but as there is no hadoop process, it looks like the backend of mr is spark.
Honestly, I'm a little bit confused about this. I guess the first one is recommended as mr has been deprecated. But where is the hdfs involved?
I understood it differently.
Normally Hive uses MR as execution engine, unless you use IMPALA, but not all distros have this.
But for a period now Spark can be used as execution engine for Spark.
https://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/ discusses this in more detail.
Apache Spark builds DAG(Directed acyclic graph) whereas Mapreduce goes with native Map and Reduce. While execution in Spark, logical dependencies form physical dependencies.
Now what is DAG?
DAG is building logical dependencies before execution.(Think of it as a visual graph)
When we have multiple map and reduce or output of one reduce is the input to another map then DAG will help to speed up the jobs.
DAG is build in Tez (right side of photo) but not in MapReduce (left side).
NOTE:
Apache Spark works on DAG but have stages in place of Map/Reduce. Tez have DAG and works on Map/Reduce. In order to make it simpler i used Map/Reduce context but remember Apache Spark have stages. But the concept of DAG remains the same.
Reason 2:
Map persists its output to disk.(buffer too but when 90% of it is filled then output goes into disk) From there data goes to merge.
But in Apache Spark intermediate data is persist to memory which makes it faster.
Check this link for details

how to use flink and spark together,and spark just for transformation?

Let`s say there is a collection "goods" in mongodb like this:
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
In the past,I use spark to flatten it and save to hive:
goodsDF.select($"name",explode($"attribute"))
But,now we need to handle incremental data,
for example,there are a new good in the third line in the next day
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
{name:"C",attr:["location":"uk"],"eventTime":"2018-02-01"}
some of our team think flink is better on streaming,because flink has event driver application,streaming pipeline and batch,but spark is just micro batch.
so we change to use flink,but there are a lot of code has been written by spark,for example,the "explode" above,so my question is:
Is it possible to use flink to fetch source and save to the sink,but in the middle,use spark to transform the dataset?
If it is not possible,how about save it to a temporary sink,let`s say,some json files,and then spark read the files and transform and save to hive.But I am afraid this makes no sense,because for spark,It is also incremental data.Use flink then use spark is the same as use spark Structured Streaming directly.
No. Apache Spark code can not be used in Flink without making changes in code. As these two are different processing frameworks and APIs provided by two and it's syntax are different from each other. Choice of framework should really be driven by the use case and not by generic statements like Flink is better than Spark. A framework may work great for your use case and it may perform poorly in other use case. By the way, Spark is not just micro batch. It has batch, streaming, graph, ML and other things. Since the complete use case is not mentioned in question, it would be hard to suggest which one is better for this scenario. But if your use case can afford sub-second latency then I would not waste my time in moving to another framework.
Also, if the things are dynamic and it is anticipated that processing framework may change in future it would be better to use something like apache beam which provides abstraction over most of the processing engines. Using apache beam processing APIs will give you flexibility to change underlying processing engine any time. Here is the link to read more about beam - https://beam.apache.org/.

Spark-java multithreading vs running individual spark jobs

I am new with Spark and trying to understand performance difference in below approaches (Spark on hadoop)
Scenario : As per batch processing I have 50 hive queries to run.Some can run parallel and some sequential.
- First approach
All of queries can be stored in a hive table and I can write a Spark driver to read all queries at once and run all queries in parallel ( with HiveContext) using java multi-threading
pros: easy to maintain
Cons: all resources may get occupied and
performance tuning can be tough for each query.
- Second approach
using oozie spark actions run each query individual
pros:optimization can be done at query level
cons: tough to maintain.
I couldn't find any document about the first approach that how Spark will process queries internally in first approach.From performance point of view which approach is better ?
The only thing on Spark multithreading I could found is:
"within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads"
Thanks in advance
Since your requirement is to run hive queries in parallel with the condition
Some can run parallel and some sequential
This kinda of workflows are best handled by a DAG processor which Apache Oozie is. This approach will be clearner than you managing your queries by code i.e. you will be building your own DAG processor instead of using the one provided by oozie.

What pyspark api calls requiere same version of python in workers in yarn-client mode

usually I run my code with different versions of Python in the driver than in the worker nodes, using yarn-client mode.
For instance, I usually use python3.5 in the driver and the default python2.6 in workers and this works pretty.
I am currently in a project where we need to call
sqlContext.createDataFrame
But this seems to try to execute this sentence in python in the workers and then I got the requirement of installing the same version of python in workers which is what I am trying to avoid.
So, For using "sqlContext.createDataFrame" it is a requirement to have the same python version in driver and workers ?
And if so, which other "pure" pyspark.sql api call would also have this requirement?
Thanks,
Jose
Yes, the same Python verion is the requirement in general. Some API call may not fail because there is no Python executor in use but it is not a valid configuration.
Every call that interacts with Python code, like udf or DataFrame.rdd.* will trigger the same exception.
If you want to avoid upgrading cluster Python then use Python 2 on the driver.
In general, many pyspark operations are just a wrapper to calling spark operations on the JVM. For these operations it doesn't matter what version of python is used in the worker because no python is executed on the worker, only JVM operations.
Examples of such operations include reading a dataframe from file, all built-in functions which do not require python objects/functions as input etc.
Once a function requires an actual python object or function this becomes a little trickier.
Let's say for example that you want to use a UDF and use lambda x: x+1 as the function.
Spark doesn't really know what the function is. Instead it serializes it and sends it to the worker who de-serialize it in turn.
For this serialization/de-serialization process to work, the versions of both sides need to be compatible and that is often not the case (especially between major versions).
All of this leads us to createDataFrame. If you use RDD as one of the parameters for example, the RDD would contain python objects as the records and these would need to be serialized and de-serialized and therefore must have the same version.

Storm and spark

I want to check and see if it is a good idea to invoke Spark code from a storm bolt. We have a stream based system in Storm. So per message we would like to do so ML and we are thinking of using Spark for that. So wanted to check if it is a good idea to do so. Any run time issues we might encounter ?
Thanks
ap
if you already have a system in place with Storm, then why do you want to use Spark?
IMHO both Spark and storm are different beast, you may want to run them in parallel for same or different use cases but do not tightly integrate each other.
What do you mean ML per message? ML on a single message doesn't make much sense. Do you mean a ML on a stream? Sure you can do it with Spark, but then you need to either use Spark Streaming (and you have two streaming architectures...) or save the data somewhere and do batch ML with Spark.
Why not use trident-ml instead?

Resources