Structuring spark jobs - apache-spark

Is it possible to perform a different set of transformations parallel on a single DStream source?
For example:
I might read a file and get a single DStream. Is it possible to perform
1. reduceByKey and other operations
2. reduceByKeyAndWindow and many more
3. Some other aggregate and more
The above 3 set of operations are independent and must be performed separately from a single source.
The question is, what are the effects of reshuffle?
Assume, all 3 dont require reshuffling and they are evaluated in parallel as opposed to one of the steps requires reshuffle. That will result in two different results?
Trying to understand the flow of a spark job. Is it possible to run multiple different transformations parallel that would require reshuffle?
In such case, it is better to run multiple spark jobs altogether?

Related

How to parallelize work in spark when working with multiple dataframes?

Is parallelism among multiple dataframe supported in spark?
I have a task which net me 100s of dataframe, and I want to perform transformation and write to each of them.
However union them together seems to be not performant?
Are there any other concurrency primitive that work with multiple dataframes so that it scales horizontally with number of servers?
No, there isn't a primitive to operate on multiple data frames. But there is nothing stopping you from launching a batch script to saturate your cluster. If that what you want to do.
Write the script for 1 data frame, then launch 100s of jobs to do the work.

Using Spark, how do I read multiple files in parallel from different folders in HDFS?

I have 3 folders containing csv files in 3 different schemas in HDFS.All 3 files are huge ( several GBs). I want to read the files in parallel and process the rows in them in parallel. How do I accomplish this is on a yarn cluster using Spark?
Assuming, you are using Scala, create a parallel collection of your files using the hdfs client and the .par convenience method, then map the result onto spark.read and call an action -- voilà, if you have enough resources in the cluster, you'll have all files being read in parallel. At worst, Spark's job scheduler will shuffle the execution of certain tasks around to minimize wait times.
If you don't have enough workers/executors, you won't gain much, but if you do, you can fully exploit those resources, without having to wait for each job to finish, before you send out the next.
Due to lazy evaluation this may happen anyway, depending on how you work with the data -- but you can force parallel execution of several actions/jobs by using parallelism or Futures.
If you want to process all the data separately, you can always write 3 spark jobs to process them separately and execute them in the cluster in parallel. There are several way to run all 3 jobs in parallel. The most straight forward is to have a oozie workflow with 3 parallel sub-workflow.
Now if you want to process 3 datasets in the same job, you need to read them sequentially. After that you can process the datasets. When you process multiple datasets using spark operation, Spark parallelize them for you. The closure of the operation will be shipped to the executors and all will work in parallel.
What do you mean under "read the files in parallel and process the rows in them in parallel"? Spark deals with your data in parallel itself according to your application configuration (num-executors, executor-cores...).
If you mean 'start reading files at the same time and process simultaneously', I'm pretty sure, you can't explicitly get it. It would demand some capabilities to affect the DAG of your application, but as I know, the only way to do it is implicitly, when building your data process as a sequence of transformations/actions.
Spark is also designed in such way, that it can execute several stages simultaneously "out of box", if your resource allocation allows.
I had encountered similar situation recently.
You can pass a list of CSVs with their paths to spark read api like spark.read.json(input_file_paths) (source). This will load all the files in a single dataframe and all the transformations eventually performed will be done in parallel by multiple executors depending on your spark config.

Same set of Tasks are repeated in multiple stages in a Spark Job

A group of tasks consists of filters & maps appears in DAG visualization of multiple stages. Does this mean the same transformations are recomputed in all the stages? If so how to resolve this?
For every action performed on a dataframe, all transformations will be recomputed. This is due to the transformations not being computed until an action is performed.
If you only have a single action then there is nothing you can do, however, in the case of multiple actions after each other, then cache() can be used after the last transformation. By using this method Spark will save the dataframe to RAM after the first computation, making subsequent actions much faster.

On which way does RDD of spark finish fault-tolerance?

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. But, I did not find the internal mechanism on which the RDD finish fault-tolerance. Could somebody describe this mechanism?Thanks.
Let me explain in very simple terms as I understand.
Faults in a cluster can happen when one of the nodes processing data is crashed. In spark terms, RDD is split into partitions and each node (called the executors) is operating on a partition at any point of time. (Theoretically, each each executor can be assigned multiple tasks depending on the number of cores assigned to the job versus the number of partitions present in the RDD).
By operation, what is really happening is a series of Scala functions (called transformations and actions in Spark terms depending on if the function is pure or side-effecting) executing on a partition of the RDD. These operations are composed together and Spark execution engine views these as a Directed Acyclic Graph of operations.
Now, if a particular node crashes in the middle of an operation Z, which depended on operation Y, which inturn on operation X. The cluster manager (YARN/Mesos) finds out the node is dead and tries to assign another node to continue processing. This node will be told to operate on the particular partition of the RDD and the series of operations X->Y->Z (called lineage) that it has to execute, by passing in the Scala closures created from the application code. Now the new node can happily continue processing and there is effectively no data-loss.
Spark also uses this mechanism to guarantee exactly-once processing, with the caveat that any side-effecting operation that you do like calling a database in a Spark Action block can be invoked multiple times. But if you view your transformations like pure functional mapping from one RDD to another, then you can be rest assured that the resulting RDD will have the elements from the source RDD processed only once.
The domain of fault-tolerance in Spark is very vast and it needs much bigger explanation. I am hoping to see others coming up with technical details on how this is implemented, etc. Thanks for the great topic though.

Is it possible to run multiple aggregation jobs on a single dataframe in parallel in spark?

Is there any way to run multiple independent aggregation jobs on a single RDD in parallel? First preference is Python then Scala and Java.
The course of actions in order of preference are -
Using Threadpool - run different functions doing different aggregations on different threads. I did not see an example which does this.
Using cluster mode on yarn , submitting different jars. Is this possible , if yes then is it possible in pyspark?
Using Kafka - run different spark-submits on the dataframe streaming through kafka.
I am quite new to Spark , and my experience ranges on running Spark on Yarn for ETL doing multiple aggregations serially. I was thinking if it was possible to run these aggregations in parallel as they are mostly independent.
Consider your broad question, here is a broad answer :
Yes, it is possible to run multiple aggregation jobs on a single DataFrame in parallel.
For the rest, it doesn't seem to be clear what you are asking.

Resources