Reading multiple directories into multiple spark dataframes - apache-spark

I'm trying to read a list of directories each into its own dataframe. E.g.
dir_list = ['dir1', 'dir2', ...]
df1 = spark.read.csv(dir_list[0])
df2 = spark.read.csv(dir_list[1])
...
Each directory has data of varying schemas.
I want to do this in parallel, so just a simple for loop won't work. Is there any way of doing this?

Yes there is :))
"How" will depend on what kind of processing you do after read, because by itself the spark.read.csv(...) won't execute until you call an action (due to Spark's lazy evaluation) and putting multiple reads in a for loop will work just fine.
So, if the results of evaluating multiple dataframes have the same schema, parallelism can be simply achieved by UNION'ing them. For example,
df1 = spark.read.csv(dir_list[0])
df2 = spark.read.csv(dir_list[1])
df1.withColumn("dfid",lit("df1")).groupBy("dfid").count()
.union(df2.withColumn("dfid",lit("df1")).groupBy("dfid").count())
.show(truncate=False)
... will cause both dir_list[0] and dir_list[1] to be read in parallel.
If this is not feasible, then there is always a Spark Scheduling route:
Inside a given Spark application (SparkContext instance), multiple
parallel jobs can run simultaneously if they were submitted from
separate threads. By “job”, in this section, we mean a Spark action
(e.g. save, collect) and any tasks that need to run to evaluate that
action.

Related

Executing a function in parallel to process huge XML files in PySpark

I have a Spark dataframe filedf which has only 1 column (filename) and many rows. These are filenames of the XML files with size>= 1GB. There is another function as below.
def transformfiles(filename):
ordered_dict = xmltodict.parse(filename)
<do process 1>
<do process 2>
I want to call the function transformfiles on all the rows of the dataframe filedf concurrently.
Currently I am using a for loop to loop through all the rows in the dataframe and call this function which only runs sequentially.
filename=filedf.select(filenames).collect()
filelist=[r['filename'] for r in [filenames]
for fname in filelist:
transformfiles(fname)
I have also tried the udf approach of wrapping the function in a udf and then using it in withColumn like below.
def transformfiles(filename):
ordered_dict = xmltodict.parse(filename)
<do process 1>
<do process 2>
return "Success"
transform_udf=udf(lambda x:transformfiles(x), StringType())
df2=filedf.withColumn("process_status",transform_udf("filename"))
Both these approaches runs in the same time.
I am running 140 GB mem, 20 core cluster with 17 workers.
Please let me know if there is an approach to bring about parallelism while doing this. I am not sure if the approach I am using utilizes the cluster resources efficiently.
Depending on how you built that dataframe with the filenames, it can be that it consists of just one partition. And as Spark parallelizes on partition level, then indeed the udf version would still basically go sequentially over all the data.
You must make sure that you partition your dataframe into multiple partitions, and then those partitions can be handled in parallel by multiple executors/workers.
Use something like filedf.repartition(numPartitions, "filename") to ensure your data is distributed over multiple partitions. For the number of partitions, that depends on various things, like how much resources an executor would need to parse such an XML file (and so how many concurrent parsing jobs you can have running on your cluster), things like possible data skew, etc. You could always start out with e.g. the default value of 200 to see the effect and start tuning.
An additional remark: Your dataframe contains just the filename, and you do not return actual parsed data from the UDF. So, apparently your transformfile function takes care of actually getting the file content, and writing/handling the parsed data somewhere (so you want to use Spark mainly for easy parallelization, not really for data processing?). Ensure that you don't have any bottlenecks in those parts either (for example if 200 Spark executors would start concurrently writing to a single external destination and overload it).

Is Spark DAG execution order in parallel or sequential?

I have two sources, they can be different type of sources(database or files) or can be of same type.
Dataset1 = source1.load;
Dataset2 = source2.load;
Will spark loads the data parallelly into different datasets or will it load in sequence?
Actions occur sequentially. Your statement ... will load parallel into different datasets ... has as answer sequentially as these are Actions.
Data pipelines required for Actions including the Transformations, occur in parallel where possible. E.g. creating a Data Frame with 4 loads that are subject to Union, say, will cause those loads to occur in parallel, provided enough Executors (Slots) can be allocated.
So, as the comment also states, you need an Action and the DAG path will determine flow and any parallelism that can be applied. You can see that in the Spark UI.
To demonstrate:
rdd1 = get some data
rdd2 = get some other data
rdd3 = get some other other data
rddA = rdd1 union rdd2 union rdd3
rddA.toDF.write ...
// followed by
rdd1' = get some data
rdd2' = get some other data
rdd3' = get some other other data
rddA' = rdd1 union rdd2 union rdd3
rddA'.toDF.write ...
rddA'.toDF.write ... will occur after rddA.toDF.write... None of rdd1' and rdd2' and rdd3' Transformations occur in parallel with rddA.toDF.write 's Transformations / Action. That cannot be the case. This means that if you want write parallelism you need two separate SPARK apps - running concurrently - provided resources allow that of course.

Spark: "SparkException: Task not serializable" when joining a to-be transformed Dataset

I have the following codes. It performs a join operation on two Datasets, one of which is being filtered inside the join transformation.
activeUserProfileDataset.join(
allJobModelsDataset.filter(jobModel => jobIdRecCandidatesBroadcasted.value.contains(jobModel.JobId)),
$"notVisitedJobId" === col(JobModelFieldNames.jobId),
"left_outer")
This caused problem:
SparkException: Task not serializable
However, when I take out the filter transformation and create the second Dataset outside of join, this time works:
val jobIdRecCandidatesJobModels = allJobModelDataset.filter(jobModel => jobIdRecCandidatesBroadcasted.value.contains(jobModel.JobId))
val userJobPredictionsDataset3 = userJobPredictionsDataset2.join(
jobIdRecCandidatesJobModels,
$"notVisitedJobId" === col(JobModelFieldNames.jobId),
"left_outer")
Could you explain to me why is this? Could you tell me how these transformation operations (like join, filter) works internally?
Thanks!
This is because in Spark you can't specify transformations inside another transformations.
The main idea:
Driver node processes DAGs and creates tasks
Workers execute transformations (in form of tasks)
In first example you try to process DAGs and create tasks INSIDE a transformation (on a worker node). Generally you crate a task that need to create tasks on another DF). But remember - workers can't create new tasks. They only execute them.
In second example - you do everything correctly on driver node. E.g. you first create transformations on DF and then you just use resulting DF in new task.
Hope that helps :)

Structuring spark jobs

Is it possible to perform a different set of transformations parallel on a single DStream source?
For example:
I might read a file and get a single DStream. Is it possible to perform
1. reduceByKey and other operations
2. reduceByKeyAndWindow and many more
3. Some other aggregate and more
The above 3 set of operations are independent and must be performed separately from a single source.
The question is, what are the effects of reshuffle?
Assume, all 3 dont require reshuffling and they are evaluated in parallel as opposed to one of the steps requires reshuffle. That will result in two different results?
Trying to understand the flow of a spark job. Is it possible to run multiple different transformations parallel that would require reshuffle?
In such case, it is better to run multiple spark jobs altogether?

Parallel writing from same DataFrame in Spark

Let's say I have a DataFrame in Spark and I need to write the results of it to two databases, where one stores the original data frame but the other stores a slightly modified version (e.g. drops some columns). Since both operations can take a few moments, is it possible/advisable to run these operations in parallel or will that cause problems because Spark is working on the same object in parallel?
import java.util.concurrent.Executors
import scala.concurrent._
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
def write1(){
//your save statement for first dataframe
}
def write2(){
//your save statement for second dataframe
}
def writeAllTables() {
Future{ write1()}
Future{ write2()}
}
Let me ask you, do you really need to do it? If you are not sure, then, you most probably don't.
So, lets assume below scenario similar to one you explained:
val df1 = spark.read.csv('someFile.csv') // Original Dataframe
val df2 = df1.withColumn("newColumn", concat(col("oldColumn"), lit(" is blah!"))) // Modified Dataframe, FYI, this df2 is a different object
df1.write('db_loc1') // Write to DB1, already parallelised & uses spark resources optimally
df2.write('db_loc2') // Write to DB2, already parallelised & uses spark resources optimally
Spark scheduler divides the first DataFrame df1 into partitions and writes them in parallel in db_loc1.
It picks up the second DataFrame df2 and again breaks it into partitions and writes these partitions in parallel in db_loc2.
By default, the degree of parallelisation per write is speculated in order to optimally use available cluster resources.
Small writes might not be repartitioned as mostly the write time is low and repartitioning will only increase overhead. In a extraordinary case where you have a lot of small writes, it might make a good case for trying to parallelise these writes. But, the best way to do so is to redesign your code to run one spark job per DataFrame instead of trying to parallelise DataFrame.write() call in same driver program.
Large writes will probably use all available resources in parallel during the single DataFrame write itself. Hence, if spark allowed issuing another write operation for a different DataFrame at the same time, it would only delay both operations as now they are racing with each other for resources. Not to mention, there may be some performance slowdown due to increased overhead because of sheer increase in number of tasks that spark now needs to manage and track.
Also, You can read this answer to and learn more about this

Resources