RDD of gziped files to "uncompressed" Dataframe - apache-spark

I have a bunch of compressed text files each line of which containing a JSON object. Simplified my workflow looks like this:
string_json = sc.textFile('/folder/with/gzip/textfiles/')
json_objects = string_json.map(make_a_json)
DataRDD = json_objects.map(extract_data_from_json)
DataDF = sqlContext.createDataFrame(DataRDD,schema).collect()
'''followed by some transformations to the dataframe'''
Now, the code works fine. The problem arises as soon as the number files can not be evenly divided between executors.
That is as far as I understand it, because spark is not extracting the files and then distributing the rows to the executors, but rather each executioner gets one file to work with.
e.g If i have 5 files and 4 executors, the first 4 files are processed in parallel and then the 5th file.
Because the 5th is not being processed in parallel with the other 4 and cannot be divided between the 4 executors, it takes the same amount of time as the first 4 together.
This happens at every stages of the program.
Is there a way to transform this kind compartmentalized RDD either into a RDD or Dataframe that is not ?
I'm using python 3.5 and spark 2.0.1

Spark operations are divided into tasks, or units of work that can be done in parallel. There are a few things to know about sc.textFile:
If you're loading multiple files, you're going to get 1 task per file, at minimum.
If you're loading gzipped files, you're going to get 1 task per file, at maximum.
Based on these two premises,your use case is going to see one task per file. You're absolutely right about how the tasks / cores ratio affects wall clock time: having 5 tasks running on 4 cores will take roughly the same time as 8 tasks on 4 cores (though not quite, because stragglers exist and the first core to finish will take on the 5th task).
A rule of thumb is that you should have roughly 2-5 tasks per core in your Spark cluster to see good performance. But if you only have 5 gzipped text files, you're not going to see this. You could try to repartition your RDD (which uses a somewhat expensive shuffle operation) if you're doing a lot downstream:
repartitioned_string_json = string_json.repartition(100, shuffle=True)

Related

Improving performance for Spark with a large number of small files?

I have millions of Gzipped files to process and converting to Parquet. I'm running a simple Spark batch job on EMR to do the conversion, and giving it a couple million files at a time to convert.
However, I've noticed that there is a big delay from when the job starts to when the files are listed and split up into a batch for the executors to do the conversion. From what I have read and understood, the scheduler has to get the metadata for those files, and schedule those tasks. However, I've noticed that this step is taking 15-20 minutes for a million files to split up into tasks for a batch. Even though the actual task of listing the files and doing the conversion only takes 15 minutes with my cluster of instances, the overall job takes over 30 minutes. It appears that it takes a lot of time for the driver to index all the files to split up into tasks. Is there any way to increase parallelism for this initial stage of indexing files and splitting up tasks for a batch?
I've tried tinkering with and increasing spark.driver.cores thinking that it would increase parallelism, but it doesn't seem to have an effect.
you can try by setting below config
spark.conf.set("spark.default.parallelism",x)
where x = total_nodes_in_cluster * (total_core_in_node - 1 ) * 5
This is a common problem with spark (and other big data tools) as it uses only on driver to list all files from the source (S3) and their path.
Some more info here
I have found this article really helpful to solve this issue.
Instead of using spark to list and get metadata of files we can use PureTools to create a parallelized rdd of the files and pass that to spark for processing.
S3 Specific Solution
If you don not want to install and setup tools as in the guide above you can also use a S3 manifest file to list all the files present in a bucket and iterate over the files using rdds in parallel.
Steps for S3 Manifest Solution
# Create RDD from list of files
pathRdd = sc.parallelize([file1,file2,file3,.......,file100])
# Create a function which reads the data of file
def s3_path_to_data(path):
# Get data from s3
# return the data in whichever format you like i.e. String, array of String etc.
# Call flatMap on the pathRdd
dataRdd = pathRdd.flatMap(s3_path_to_data)
Details
Spark will create a pathRdd with default number of partitions. Then call the s3_path_to_data function on each partition's rows in parallel.
Partitions play an important role in spark parallelism. e.g.
If you have 4 executors and 2 partitions then only 2 executors will do the work.
You can play around num of partitions and num of executors to achieve the best performance according to your use case.
Following are some useful attributes you can use to get insights on your df or rdd specs to fine tune spark parameters.
rdd.getNumPartitions
rdd.partitions.length
rdd.partitions.size

Spark SQL output multiple small files

We are having multiple joins involving a large table (about 500gb in size). The output of the joins is stored into multiple small files each of size 800kb-1.5mb. Because of this the job is split into multiple tasks and taking a long time to complete.
We have tried using spark tuning configurations like using broadcast join, changing partition size, changing max records per file etc., But there is no performance improvement with this methods and the issue is also not fixed. Using coalesce makes the job struck at that stage and there is no progress.
Please view this link for Spark UI metrics screenshot, https://i.stack.imgur.com/FfyYy.png
The spark UI confirms your report of too many small files. You will get a file for every spark partition, and you have 33,479 in your final stage where you're writing the output. 33k partitions was probably the right number of partitions for your join but not the right number for your write.
You need to add another stage in your job that comes after your join. That 2nd needs to reduce the number of spark partitions to a reasonable number (that outputs 32MB - ~128MB files)
Something like a coalesce, or repartition. Maybe even a sort :(
You want to target ~350 partitions.
This diagram shows what you want to do manually or automatically (with spark on Databricks)
If you're using Databricks then it's easy as with Delta Lake you can turn on Auto Optimize

Huge Multiline Json file is being processed by single Executor

I have a huge json file 35-40GB size, Its a MULTILINE JSON on hdfs. I have made use of .option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
with Pyspark.
I have bumped up 60 Executors, 16 cores, 16GB Ememory and set memory overhead parameters.
Every run the Executors were being lost.
It is perfectly working for smaller files, but not with files > 15 GB
I have enough cluster resources.
From the spark UI what I have seen is every time the data is being processed by single executor, all other executors were idle.
I have seen the stages (0/2) Tasks(0/51)
I have re-partitioned the data as well.
Code:
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
df.count()
df.rdd.glom().map(len).collect()
df.write.... (HDFSLOCATION, format='csv')
Goal: My goal is to apply UDF function on each of the column and clean the data and write to CSV format.
Size of dataframe is 8 million rows with 210 columns
Rule of thumb, Spark's parallelism is based on the number of input files. But you just specified only 1 file (MULTILINE_JSONFILE_.json), so Spark will use 1 cpu for processing following code
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json')
even if you have 16 cores.
I would recommend that you split a json file into many files.
More precisely, parallelism is base on number of blocks of files if files are stored on HDFS. if MULTILINE_JSONFILE_.json is 40GB, it might have more than 400 blocks if the block size is 128MB. So, Spark tasks should run in parallel if the file is located in HDFS. If you are stuck with parallelism, I think this is because option("multiline", false) is specified.
In databricks documentation, you can see following sentence.
Files will be loaded as a whole entity and cannot be split.

Spark 12 GB data load with Window function Performance issue

I am using sparksql to transform 12 GB data.My Transformation is to apply row number function with partition by on one of fields then divide data into two sets first set where row number is 1 and 2nd set include rest of data then write data to target location in 30 partitions.
My job is currently taking approximately 1 hour.I want to run it in less than 10 mins.
I am running this job on 3 Node cluster with specs(16 Cores & 32 GB RAM).
Node 1 yarn master node.
Node 2 Two Executors 1 driver and 1 other
Node 3 Two executors both for processing.
Each executor is assigned 5 cores and 10GB memory.
Is my hardware enough or i need more powerful hardware?
Is executors configuration right?
If both hardware and configuration is good then definitely i need to improve my code.
My code is as follow.
sqlContext=SQLContext(sc)
SalesDf = sqlContext.read.options(header='true').load(path, format='csv')
SalesDf.cache()
SalesDf_Version=SalesDf.withColumn('row_number',F.row_number().over(Window.partitionBy("id").orderBy(desc("recorddate"))))
initialLoad = SalesDf_Version.withColumn('partition',SalesDf_Version.year).withColumn('isActive', when(col('row_number') == 1, lit('Y')).when(col('row_number') != 1, lit('N')))
initialLoad = initialLoad.withColumn('version_flag',col ('isActive')).withColumn('partition',col('city'))
initialLoad = initialLoad.drop('row_number')
initialLoad.coalesce(1).write.partitionBy('isActive','partition').option("header", "true").mode('overwrite').csv(path +'Temp/target/')
initialLoad.coalesce(1).write.partitionBy('isActive','partition').option("header", "true").mode('overwrite').csv(path +'Temp/target/')
sc.stop()
Thanks in advance for your help
You have a coalesce(1) before writing, what is the reason for that? Coalesce reduces the parallelization of that stage which in your case will cause the windowing query to run on 1 core, so you're losing the benefit of the 16 cores per node.
Remove the coalesce and that should start improving things.
Following were the changes that we implemented to improve performance of our code.
We removed coalesce and used repartition(50).We tried higher and lower numbers in the brackets but 50 was the optimized number in our case.
We were using s3 as our target but it was costing us alot because of rename thing in spark so we used HDFS instead and our job time was reduced to half of what it was before.
Overall by above changes our code ran 12 mins previously it was 50 mins.
Thanks
Ammar

Why is the processing time the same for 2 and 4 cores and different partitions in Spark Streaming?

I'm trying to run some tests regarding processing times for a Spark Streaming Application, in local mode in my 4 core machine.
Here is my code:
SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("sparkstreaminggetjson");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
JavaReceiverInputDStream<String> streamData1 = ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER);
streamData1.print();
I am receiving 1 JSON message per second.
So, I test this for 4 different scenerios:
1) setMaster(...local[2]) and 1 partition
2) setMaster(...local[*]) and 1 partition
3)setMaster(...local[2]) and 4 partitions (using streamData1.repartition(4))
4) setMaster(...local[*]) and 4 partitions (using streamData1.repartition(4))
When I check the average processing times in the UI, this is what I get for each scenario:
1) 30 ms
2) 28 ms
3) 72 ms
4) 75 ms
My question is: why are the processing times pretty much the same for 1 and 2, and 3 and 4?
I realize that the increase from 2 to 4 for example is normal, because repartition is a shuffle operation. What I don't get is, for example in 4), why is the processing so similar to 3? Shouldn't it be much smaller since I am increasing the level of paralelization, and I have more cores to distribute the tasks to?
Hope I wasn't confusing,
Thank you so much in advance.
Some of this depends on what your JSON message looks like, I'll assume each message is a single string without line breaks. In that case, with 1 message per second and batch interval of 1 second, at each batch you will get an RDD with just a single item. You can't split that up into multiple partitions, so when you repartition you still have the same situation data-wise, but with the overhead of the repartition step.
Even with larger amounts of data I would not expect too much of difference when all you do to the data is print() it: this will take the first 10 items of your data, which if they can come from just one partition, I would expect Spark to optimize that to only calculate that one partition. In any case you will get more representative numbers if you significantly increase the amount of data per batch, and do some actual processing on the whole set, at minimum something like streamData1.count().print().
To get a better understanding of what happens, it is also useful to dig into the other parts of Spark's UI, like the Stages tab that can tell you how much of the execution time is shuffling, serialization, etc rather than actual execution, and things that affect performance like DAGs that tell you which bits may be cached, and tasks that Spark was able to skip.

Resources