Executing a function in parallel to process huge XML files in PySpark

Executing a function in parallel to process huge XML files in PySpark - apache-spark

I have a Spark dataframe filedf which has only 1 column (filename) and many rows. These are filenames of the XML files with size>= 1GB. There is another function as below.
def transformfiles(filename):
ordered_dict = xmltodict.parse(filename)
<do process 1>
<do process 2>
I want to call the function transformfiles on all the rows of the dataframe filedf concurrently.
Currently I am using a for loop to loop through all the rows in the dataframe and call this function which only runs sequentially.
filename=filedf.select(filenames).collect()
filelist=[r['filename'] for r in [filenames]
for fname in filelist:
transformfiles(fname)
I have also tried the udf approach of wrapping the function in a udf and then using it in withColumn like below.
def transformfiles(filename):
ordered_dict = xmltodict.parse(filename)
<do process 1>
<do process 2>
return "Success"
transform_udf=udf(lambda x:transformfiles(x), StringType())
df2=filedf.withColumn("process_status",transform_udf("filename"))
Both these approaches runs in the same time.
I am running 140 GB mem, 20 core cluster with 17 workers.
Please let me know if there is an approach to bring about parallelism while doing this. I am not sure if the approach I am using utilizes the cluster resources efficiently.

Depending on how you built that dataframe with the filenames, it can be that it consists of just one partition. And as Spark parallelizes on partition level, then indeed the udf version would still basically go sequentially over all the data.
You must make sure that you partition your dataframe into multiple partitions, and then those partitions can be handled in parallel by multiple executors/workers.
Use something like filedf.repartition(numPartitions, "filename") to ensure your data is distributed over multiple partitions. For the number of partitions, that depends on various things, like how much resources an executor would need to parse such an XML file (and so how many concurrent parsing jobs you can have running on your cluster), things like possible data skew, etc. You could always start out with e.g. the default value of 200 to see the effect and start tuning.
An additional remark: Your dataframe contains just the filename, and you do not return actual parsed data from the UDF. So, apparently your transformfile function takes care of actually getting the file content, and writing/handling the parsed data somewhere (so you want to use Spark mainly for easy parallelization, not really for data processing?). Ensure that you don't have any bottlenecks in those parts either (for example if 200 Spark executors would start concurrently writing to a single external destination and overload it).

Related

Improving performance for Spark with a large number of small files?

I have millions of Gzipped files to process and converting to Parquet. I'm running a simple Spark batch job on EMR to do the conversion, and giving it a couple million files at a time to convert.
However, I've noticed that there is a big delay from when the job starts to when the files are listed and split up into a batch for the executors to do the conversion. From what I have read and understood, the scheduler has to get the metadata for those files, and schedule those tasks. However, I've noticed that this step is taking 15-20 minutes for a million files to split up into tasks for a batch. Even though the actual task of listing the files and doing the conversion only takes 15 minutes with my cluster of instances, the overall job takes over 30 minutes. It appears that it takes a lot of time for the driver to index all the files to split up into tasks. Is there any way to increase parallelism for this initial stage of indexing files and splitting up tasks for a batch?
I've tried tinkering with and increasing spark.driver.cores thinking that it would increase parallelism, but it doesn't seem to have an effect.

you can try by setting below config
spark.conf.set("spark.default.parallelism",x)
where x = total_nodes_in_cluster * (total_core_in_node - 1 ) * 5

This is a common problem with spark (and other big data tools) as it uses only on driver to list all files from the source (S3) and their path.
Some more info here
I have found this article really helpful to solve this issue.
Instead of using spark to list and get metadata of files we can use PureTools to create a parallelized rdd of the files and pass that to spark for processing.
S3 Specific Solution
If you don not want to install and setup tools as in the guide above you can also use a S3 manifest file to list all the files present in a bucket and iterate over the files using rdds in parallel.
Steps for S3 Manifest Solution
# Create RDD from list of files
pathRdd = sc.parallelize([file1,file2,file3,.......,file100])
# Create a function which reads the data of file
def s3_path_to_data(path):
# Get data from s3
# return the data in whichever format you like i.e. String, array of String etc.
# Call flatMap on the pathRdd
dataRdd = pathRdd.flatMap(s3_path_to_data)
Details
Spark will create a pathRdd with default number of partitions. Then call the s3_path_to_data function on each partition's rows in parallel.
Partitions play an important role in spark parallelism. e.g.
If you have 4 executors and 2 partitions then only 2 executors will do the work.
You can play around num of partitions and num of executors to achieve the best performance according to your use case.
Following are some useful attributes you can use to get insights on your df or rdd specs to fine tune spark parameters.
rdd.getNumPartitions
rdd.partitions.length
rdd.partitions.size

Number of files generated by a Spark Job

I want to monitor the number of files that spark generates, and maybe raise an exception if it is generating a lot of files. Is there any way to see this?

well it depends on how you are doing the write operation. Assuming you are writing the content of a dataframe or rdd as output, the easiest way would be to see number of partitions in your final dataframe/rdd. Basically each partition is written as a separate file.
Assuming you are using scala, this should give you the number of partitions.
df.rdd.getNumPartitions
Instead of raising an exception and causing job to fail, i would suggest that you use coalesce function to repartition the df with a value that suits you need. For example, if the output is not too large (1 Gb or less) i use coalesce(1) and write only 1 file.

Splitting spark data into partitions and writing those partitions to disk in parallel

Problem outline: Say I have 300+ GB of data being processed with spark on an EMR cluster in AWS. This data has three attributes used to partition on the filesystem for use in Hive: date, hour, and (let's say) anotherAttr. I want to write this data to a fs in such a way that minimizes the number of files written.
What I'm doing right now is getting the distinct combinations of date, hour, anotherAttr, and a count of how many rows make up combination. I collect them into a List on the driver, and iterate over the list, building a new DataFrame for each combination, repartitioning that DataFrame using the number of rows to guestimate file size, and writing the files to disk with DataFrameWriter, .orc finishing it off.
We aren't using Parquet for organizational reasons.
This method works reasonably well, and solves the problem that downstream teams using Hive instead of Spark don't see performance issues resulting from a high number of files. For example, if I take the whole 300 GB DataFrame, do a repartition with 1000 partitions (in spark) and the relevant columns, and dumped it to disk, it all dumps in parallel, and finishes in ~9 min with the whole thing. But that gets up to 1000 files for the larger partitions, and that destroys Hive performance. Or it destroys some kind of performance, honestly not 100% sure what. I've just been asked to keep the file count as low as possible. With the method I'm using, I can keep the files to whatever size I want (relatively close anyway), but there is no parallelism and it takes ~45 min to run, mostly waiting on file writes.
It seems to me that since there's a 1-to-1 relationship between some source row and some destination row, and that since I can organize the data into non-overlapping "folders" (partitions for Hive), I should be able to organize my code/DataFrames in such a way that I can ask spark to write all the destination files in parallel. Does anyone have suggestions for how to attack this?
Things I've tested that did not work:
Using a scala parallel collection to kick off the writes. Whatever spark was doing with the DataFrames, it didn't separate out the tasks very well and some machines were getting massive garbage collection problems.
DataFrame.map - I tried to map across a DataFrame of the unique combinations, and kickoff writes from inside there, but there's no access to the DataFrame of the data that I actually need from within that map - the DataFrame reference is null on the executor.
DataFrame.mapPartitions - a non-starter, couldn't come up with any ideas for doing what I want from inside mapPartitions
The word 'partition' is also not especially helpful here because it refers both to the concept of spark splitting up the data by some criteria, and to the way that the data will be organized on disk for Hive. I think I was pretty clear in the usages above. So if I'm imagining a perfect solution to this problem, it's that I can create one DataFrame that has 1000 partitions based on the three attributes for fast querying, then from that create another collection of DataFrames, each one having exactly one unique combination of those attributes, repartitioned (in spark, but for Hive) with the number of partitions appropriate to the size of the data it contains. Most of the DataFrames will have 1 partition, a few will have up to 10. The files should be ~3 GB, and our EMR cluster has more RAM than that for each executor, so we shouldn't see a performance hit from these "large" partitions.
Once that list of DataFrames is created and each one is repartitioned, I could ask spark to write them all to disk in parallel.
Is something like this possible in spark?
One thing I'm conceptually unclear on: say I have
val x = spark.sql("select * from source")
and
val y = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr")
and
val z = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr2")
To what extent is y is a different DataFrame than z? If I repartition y, what effect does the shuffle have on z, and on x for that matter?

We had the same problem (almost) and we ended up by working directly with RDD (instead of DataFrames) and implementing our own partitioning mechanism (by extending org.apache.spark.Partitioner)
Details: we are reading JSON messages from Kafka. The JSON should be grouped by customerid/date/more fields and written in Hadoop using Parquet format, without creating too many small files.
The steps are (simplified version):
a)Read the messages from Kafka and transform them to a structure of RDD[(GroupBy, Message)]. GroupBy is a case class containing all the fields that are used for grouping.
b)Use a reduceByKeyLocally transformation and obtain a map of metrics (no of messages/messages size/etc) for each group - eg Map[GroupBy, GroupByMetrics]
c)Create a GroupPartitioner that's using the previously collected metrics (and some input parameters like the desired Parquet size etc) to compute how many partitions should be created for each GroupBy object. Basically we are extending org.apache.spark.Partitioner and overriding numPartitions and getPartition(key: Any)
d)we partition the RDD from a) using the previously defined partitioner: newPartitionedRdd = rdd.partitionBy(ourCustomGroupByPartitioner)
e)Invoke spark.sparkContext.runJob with two parameters: the first one is the RDD partitioned at d), the second one is a custom function (func: (TaskContext, Iterator[T]) that will write the messages taken from Iterator[T] into Hadoop/Parquet
Let's say that we have 100 mil messages, grouped like that
Group1 - 2 mil
Group2 - 80 mil
Group3 - 18 mil
and we decided that we have to use 1.5 mil messages per partition to obtain Parquet files greater than 500MB. We'll end up with 2 partitions for Group1, 54 for Group2, 12 for Group3.

This statement:
I collect them into a List on the driver, and iterate over the list,
building a new DataFrame for each combination, repartitioning that
DataFrame using the number of rows to guestimate file size, and
writing the files to disk with DataFrameWriter, .orc finishing it off.
is completely off-beam where Spark is concerned. Collecting to driver is never a good approach, volumes and OOM issues and latency in your approach is high.
Use so the below so as to simplify and get parallelism of Spark benefits saving time and money for your boss:
df.repartition(cols...)...write.partitionBy(cols...)...
shuffle occurs via repartition, no shuffling ever with partitionBy.
That simple, with Spark's default parallelism utilized.

Does Spark distributes dataframe across nodes internally?

I am trying to use Spark for processing csv file on cluster. I want to understand if I need to explicitly read the file on each of the worker nodes to do the processing in parallel or will the driver node read the file and distribute the data across cluster for processing internally? (I am working with Spark 2.3.2 and Python)
I know RDD's can be parallelized using SparkContext.parallelize() but what in case of Spark DataFrames?
if __name__=="__main__":
spark=SparkSession.builder.appName('myApp').getOrCreate()
df=spark.read.csv('dataFile.csv',header=True)
df=df.filter("date>'2010-12-01' AND date<='2010-12-02' AND town=='Madrid'")
So if I am running the above code on cluster, will the entire operation be done by driver node or will it distribute df across cluster and each worker perform processing on its data partition?

To be strict, if you run the above code it will not read or process any data. DataFrames are basically an abstraction implemented on top of RDDs. As with RDDs, you have to distinguish transformations and actions. As your code only consists of one filter(...) transformation, noting will happen in terms of readind or processing of data. Spark will only create the DataFrame which is an execution plan. You have to perform an action like count() or write.csv(...) to actually trigger processing of the CSV file.
If you do so, the data will then be read and processed by 1..n worker nodes. It is never read or processed by the driver node. How many or your worker nodes are actually involved depends -- in your code -- on the number of partitions of your source file. Each partition of the source file can be processed in parallel by one worker node. In your example it is probably a single CSV file, so when you call df.rdd.getNumPartitions() after you read the file, it should return 1. Hence, only one worker node will read the data. The same is true if you check the number of partitions after your filter(...) operation.
Here are two ways of how the processing of your single CSV file can be parallelized:
You can manually repartition your source DataFrame by calling df.repartition(n) with n the number of partitions you want to have. But -- and this is a significant but -- this means that all data is potentially send over the network (aka shuffle)!
You perform aggregations or joins on the DataFrame. These operations have to trigger a shuffle. Spark then uses the number of partitions specified in spark.sql.shuffle.partitions(default: 200) to partition the resulting DataFrame.

How to process data in parallel but write results in a single file in Spark

I have a Spark job that:
Reads data from hdfs
Does some intensive transformation without shuffling and aggregation (only map operations)
Writes results back to hdfs
Let's say I have 10GB of raw data (40 blocks = 40 input partitions), which results in 100MB of processed data. To avoid generating many small files in hdfs I use "coalesce(1)" statement in order to write single file with results.
Doing so I get only 1 task running (because of "coalesce(1)" and absence of shuffling), which processes all 10GB in a single thread.
Is there a way to do actual intensive processing in 40 parallel tasks and reduce number of partitions right before writing to disk and avoid data shuffle?
I have an idea that might work - to cache dataframe in memory after all processing (do a count to force Spark to cache the data) and then put "coalesce(1)" and write dataframe to disk

The documentation clearly warns about this behavior and provides the solution:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
So instead
coalesce(1)
you can try
repartition(1)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string