For Loop keeps restarting in EMR (pyspark) - apache-spark

I have a nested for loop that performs operations on a data frame 10 times in the inner loop and joins the resulting 10 data frames into a single data frame once it finishes the inner loop.
UPDATE: I use a dictionary to create a list of dataframes to store each operation in and then union them at the end of the inner loop.
It then writes it to a parquet file with the iteration number of the outloop.
The outerloop has 6 iterations and therefore should result in 6 parquet files.
It goes something like this:
train=0
for i in range(0,6):
train=train+30
#For loop to aggregate input and create 10 output dataframes
dfnames={}
for j in range(0,10):
ident="_"+str(j)
#Load dataframe of around 1M rows
df=spark.read.parquet("s3://path")
dfnames['df'+ident]= #Perform aggregations and operations
#Combine the 10 datframes into a single df
df_out=df_1.uniionByName(d_2).unionByName(df_3)...unionByName(df_10)
#Write to output parquet file
df_out.write.mode('overwrite').parquet("s3://path/" + str(train) +".parquet"
It seems to be working fine until it finishes the 3rd iteration of the outer loop. Then for some reason, it restarts the loop with another attempt id.
So I get the first 3 files, but instead of going to the 4th iteration, it restarts to give the first file all over again. I dont get any failed stages or jobs.
I have tried running the for loops alone with dummy variables and print statements (without loading the large data frames etc) and they work fine to completion.
I am thinking it has something to do with the way the memory is being flushed after a loop.
These are my EMR Spark running conditions:
I am running this on an EMR cluster with 5 executors, 5 driver nodes, and 10 instances with a total of 50 cores. The spark executor and driver memory is 45G each with a total of about 583G.
The typical shuffle read is 250G and shuffle write is 331G.
Some of the pertinent Spark environment variables are shown below:
Is there something I am doing wrong with regards to the loop or memory management?
Any insight would be greatly appreciated!

Try to not combine Python data structures with Spark data structures.
You want to convert the for loops into a map-reduce, foreach form of design.
Along with this, you can create a cache/ spark checkpoint in each iteration to avoid rerunning the entire DAG from scratch.
To cache your data:
df.cache()
for checkpointing
spark.sparkContext.setCheckpointDir('<some path>')
df.checkpoint()
These will show performance and scale improvement once you use the spark constructs instead of python constructs. For example, replace your for loop by foreach, replace union of a list by map reduce.

How are you getting your df1, df2... before this line?
#Combine the 10 datframes into a single df df_out=df1.uniionByName(d2).unionByName(df3)...unionByName(df10)
My guess is, your dataframe plan is growing big and that might be causing issues.
I would suggest creating a list of dataframes in the inner loop and use reduce method to union them.
Something like below
from functools import reduce
from pyspark.sql import DataFrame
df_list = []
for j in range(0,10):
#Load dataframe of around 1M rows
df = spark.read.parquet("s3://path")
transformed_df = #do your transforms
df_list.append(transformed_df)
final_df = reduce(DataFrame.unionByName, df_list)

Related

Executing a function in parallel to process huge XML files in PySpark

I have a Spark dataframe filedf which has only 1 column (filename) and many rows. These are filenames of the XML files with size>= 1GB. There is another function as below.
def transformfiles(filename):
ordered_dict = xmltodict.parse(filename)
<do process 1>
<do process 2>
I want to call the function transformfiles on all the rows of the dataframe filedf concurrently.
Currently I am using a for loop to loop through all the rows in the dataframe and call this function which only runs sequentially.
filename=filedf.select(filenames).collect()
filelist=[r['filename'] for r in [filenames]
for fname in filelist:
transformfiles(fname)
I have also tried the udf approach of wrapping the function in a udf and then using it in withColumn like below.
def transformfiles(filename):
ordered_dict = xmltodict.parse(filename)
<do process 1>
<do process 2>
return "Success"
transform_udf=udf(lambda x:transformfiles(x), StringType())
df2=filedf.withColumn("process_status",transform_udf("filename"))
Both these approaches runs in the same time.
I am running 140 GB mem, 20 core cluster with 17 workers.
Please let me know if there is an approach to bring about parallelism while doing this. I am not sure if the approach I am using utilizes the cluster resources efficiently.
Depending on how you built that dataframe with the filenames, it can be that it consists of just one partition. And as Spark parallelizes on partition level, then indeed the udf version would still basically go sequentially over all the data.
You must make sure that you partition your dataframe into multiple partitions, and then those partitions can be handled in parallel by multiple executors/workers.
Use something like filedf.repartition(numPartitions, "filename") to ensure your data is distributed over multiple partitions. For the number of partitions, that depends on various things, like how much resources an executor would need to parse such an XML file (and so how many concurrent parsing jobs you can have running on your cluster), things like possible data skew, etc. You could always start out with e.g. the default value of 200 to see the effect and start tuning.
An additional remark: Your dataframe contains just the filename, and you do not return actual parsed data from the UDF. So, apparently your transformfile function takes care of actually getting the file content, and writing/handling the parsed data somewhere (so you want to use Spark mainly for easy parallelization, not really for data processing?). Ensure that you don't have any bottlenecks in those parts either (for example if 200 Spark executors would start concurrently writing to a single external destination and overload it).

Why does Spark crossJoin take so long for a tiny dataframe?

I'm trying to do the following crossJoin on two dataframes with 5 rows each, but Spark spawns 40000 tasks on my machine and it took 30 seconds to achieve the task. Any idea why that is happening?
df = spark.createDataFrame([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']]).toDF('a','b')
df = df.repartition(1)
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
You call a .distinct before join, it requires a shuffle, so it repartitions data based on spark.sql.shuffle.partitions property value. Thus, df.select('a').distinct() and df.select('b').distinct() result in new DataFrames each with 200 partitions, 200 x 200 = 40000
Two things - it looks like you cannot directly control the number of partitions a DF is created with, so we can first create a RDD instead (where you can specify the number of partitions) and convert it to DF. Also you can set the shuffle partitions to '1' as well. These both ensure you will have just 1 partition during the whole execution and should speed things up.
Just note that this shouldn't be an issue at all for larger datasets, for which Spark is designed (it would be faster to achieve the same result on a dataset of this size not using spark at all). So in the general case you won't really need to do stuff like this, but tune the number of partitions to your resources/data.
spark.conf.set("spark.default.parallelism", "1")
spark.conf.set("spark.sql.shuffle.partitions", "1")
df = sc.parallelize([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']], 1).toDF(['a','b'])
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
spark.conf.set sets the configuration for a single execution only, if you want more permanent changes do them in the actual spark conf file

spark coalesce(20) overwrite parallelism of repartition(1000).groupby(xxx).apply(func)

Note: This is not a question ask the difference between coalesce and repartition, there are many questions talk about this ,mine is different.
I have a pysaprk job
df = spark.read.parquet(input_path)
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
...
return pdf
df = df.repartition(1000, 'store_id', 'product_id')
df1 = df.groupby(['store_id', 'product_id']).apply(train_predict)
df1 = df1.withColumnRenamed('y', 'yhat')
print('Partition number: %s' % df.rdd.getNumPartitions())
df1.write.parquet(output_path, mode='overwrite')
Default 200 partition would reqire large memory, so I change repartition to 1000.
The job detail on spark webui looked like:
As output is only 44M, I tried to use coalesce to avoid too many little files slow down hdfs.
What I do was just adding .coalesce(20) before .write.parquet(output_path, mode='overwrite'):
df = spark.read.parquet(input_path)
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
...
return pdf
df = df.repartition(1000, 'store_id', 'product_id')
df1 = df.groupby(['store_id', 'product_id']).apply(train_predict)
df1 = df1.withColumnRenamed('y', 'yhat')
print('Partition number: %s' % df.rdd.getNumPartitions()) # 1000 here
df1.coalesce(20).write.parquet(output_path, mode='overwrite')
Then spark webui showed:
It looks like only 20 task are running.
When repartion(1000) , the parallelism was depend by my vcores number, 36 here. And I could trace the progress intutively(progress bar size is 1000 ).
After coalesce(20) , the previous repartion(1000) lost function, parallelism down to 20 , lost intuition too.
And adding coalesce(20) would cause whole job stucked and failed without notification .
change coalesce(20) to repartition(20) works, but according to document, coalesce(20) is much more efficient and should not cause such problem .
I want higher parallelism, and only the result coalesce to 20 . What is the correct way ?
coalesce is considered a narrow transformation by Spark optimizer so it will create a single WholeStageCodegen stage from your groupby to the output thus limiting your parallelism to 20.
repartition is a wide transformation (i.e. forces a shuffle), when you use it instead of coalesce if adds a new output stage but preserves the groupby-train parallelism.
repartition(20) is a very reasonable option in your use case (the shuffle is small so the cost is pretty low).
Another option is to explicitly prevent Spark optimizer from merging your predict and output stages, for example by using cache or persist before your coalesce:
# Your groupby code here
from pyspark.storagelevel import StorageLevel
df1.persist(StorageLevel.MEMORY_ONLY)\
.coalesce(20)\
.write.parquet(output_path, mode='overwrite')
Given your small output size, a MEMORY_ONLY persist + coalesce should be faster than a repartition but this doesn't hold when the output size grows

Memory efficient cartesian join in PySpark

I have a large dataset of string ids, that can fit into memory on a single node in my spark cluster. The issue is that it consumes most of the memory for a single node.
These ids are about 30 characters long. For example:
ids
O2LWk4MAbcrOCWo3IVM0GInelSXfcG
HbDckDXCye20kwu0gfeGpLGWnJ2yif
o43xSMBUJLOKDxkYEQbAEWk4aPQHkm
I am looking to write to file a list of all of the pairs of ids. For example:
id1,id2
O2LWk4MAbcrOCWo3IVM0GInelSXfcG,HbDckDXCye20kwu0gfeGpLGWnJ2yif
O2LWk4MAbcrOCWo3IVM0GInelSXfcG,o43xSMBUJLOKDxkYEQbAEWk4aPQHkm
HbDckDXCye20kwu0gfeGpLGWnJ2yif,O2LWk4MAbcrOCWo3IVM0GInelSXfcG
# etc...
So I need to cross join the dataset on itself. I was hoping to do this on PySpark using a 10 node cluster, but it needs to be memory efficient.
pySpark will handle your dataset easily and memory efficient but it will take time to process 10^8 * 10^8 records (this is estimated size of cross join result). See sample code:
from pyspark.sql.types import *
df = spark.read.csv('input.csv', header=True, schema=StructType([StructField('id', StringType())]))
df.withColumnRenamed('id', 'id1').crossJoin(df.withColumnRenamed('id', 'id2')).show()

collect RDD with buffer in pyspark

I would like a way to return rows from my RDD one at a time (or in small batches) so that I can collect the rows locally as I need them. My RDD is large enough that it cannot fit into memory on the name node, so running collect() would cause an error.
Is there a way to recreate the collect() operation but with a generator, so that rows from the RDD are passed into a buffer? Another option would be to take() 100000 rows at a time from a cached RDD, but I don't think take() allows you to specify a start position?
The best available option is to use RDD.toLocalIterator which collects only a single partition at the time. It creates a standard Python generator:
rdd = sc.parallelize(range(100000))
iterator = rdd.toLocalIterator()
type(iterator)
## generator
even = (x for x in iterator if not x % 2)
You can adjust amount of data collected in a single batch using a specific partitioner and adjusting a number of partitions.
Unfortunately it comes with a price. To collect small batches you have to start multiple Spark jobs and it is quite expensive. So generally speaking collecting an element at the time is not an option.

Resources