How Spark is writing compressed parquet file? - apache-spark

Using Apache Spark 1.6.4, with elasticsearch4hadoop plugin, I am exporting an elasticsearch index (100m documents, 100Go, 5 shards) into a gzipped parquet file, within HDFS 2.7.
I run this ETL as a Java program, with 1 executor (8 CPU, 12Go RAM).
The process of 5 tasks (because the 5 ES shards) takes about 1 hour, works fine most of the time, but sometime, I can see some Spark tasks failed because out of memory error.
During the process, I can see in the HDFS some temporary files, but they are always 0 sized.
Q: I am wondering if Spark is saving the data in memory before writing the gz.parquet file?

Related

Apache Beam AvroIO read large file OOM

Problem:
I am writing an Apache Beam pipeline to convert Avro file to Parquet file (with Spark runner). Everything works well until I start to convert large size Avro file (15G).
The code used to read Avro file to create PColletion:
PCollection<GenericRecord> records =
p.apply(FileIO.match().filepattern(s3BucketUrl + inputFilePattern))
.apply(FileIO.readMatches())
.apply(AvroIO.readFilesGenericRecords(inputSchema));
The error message from my entrypoint shell script is:
b'/app/entrypoint.sh: line 42: 8 Killed java -XX:MaxRAM=${MAX_RAM} -XX:MaxRAMFraction=1 -cp /usr/share/tink-analytics-avro-to-parquet/avro-to-parquet-deploy-task.jar
Hypothesis
After some investigation, I suspect that the AvroIO code above try to load the whole Avro file as one partition, which causes OOM issue.
One hypothesis I have is: if I can specify number of partitions when reading Avro file, let's see 100 partitions for example, then each partition will contain only 150M data, which should avoid the OOM issue.
My questions are:
Does this hypothesis lead me in the right direction?
If so, How can I specify number of partitions while reading the Avro file?
Instead of setting number of partitions, Spark session has a property called spark.sql.files.maxPartitionBytes, which is set to 128Mb by default, see reference here.
Spark uses this number to partition input file(s) while reading them into memory.
I tested with a 50Gb avro file and Spark partitioned it to 403 partitions. This Avro to Parquet conversion worked on a Spark cluster with 16Gb Mem and 4 Cores.

Huge Multiline Json file is being processed by single Executor

I have a huge json file 35-40GB size, Its a MULTILINE JSON on hdfs. I have made use of .option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
with Pyspark.
I have bumped up 60 Executors, 16 cores, 16GB Ememory and set memory overhead parameters.
Every run the Executors were being lost.
It is perfectly working for smaller files, but not with files > 15 GB
I have enough cluster resources.
From the spark UI what I have seen is every time the data is being processed by single executor, all other executors were idle.
I have seen the stages (0/2) Tasks(0/51)
I have re-partitioned the data as well.
Code:
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
df.count()
df.rdd.glom().map(len).collect()
df.write.... (HDFSLOCATION, format='csv')
Goal: My goal is to apply UDF function on each of the column and clean the data and write to CSV format.
Size of dataframe is 8 million rows with 210 columns
Rule of thumb, Spark's parallelism is based on the number of input files. But you just specified only 1 file (MULTILINE_JSONFILE_.json), so Spark will use 1 cpu for processing following code
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json')
even if you have 16 cores.
I would recommend that you split a json file into many files.
More precisely, parallelism is base on number of blocks of files if files are stored on HDFS. if MULTILINE_JSONFILE_.json is 40GB, it might have more than 400 blocks if the block size is 128MB. So, Spark tasks should run in parallel if the file is located in HDFS. If you are stuck with parallelism, I think this is because option("multiline", false) is specified.
In databricks documentation, you can see following sentence.
Files will be loaded as a whole entity and cannot be split.

unable to launch more tasks in spark cluster

I have a 6 node cluster with 8 cores and 32 gb ram each. I am reading a simple csv file from azure blob storage and writing to hive table.
when the job runs I see only a single task getting launched and single executor working and all the other executor and instances sitting idle/dead.
How to increase the number of tasks so the job can run faster.
any help appreciated
I'm guessing that your csv file is in one block. Therefore your data is only on one partition and since Spark "only" creates one task per partition, you only have one.
You can call repartition(X) on your dataframe/rdd just after reading it to increase the number of partitions. Reading won't be faster but all your transformations and the writting will be parallelized.

Spark only running one executor for big gz files

I have input source files(compressed .gz) which I need to process using Spark. Each file is 5 GBs (compressed gz) and there are around 11-12 files.
But when I give the source as input, spark just launches one executor. I understand that this may be due to the non-splittable nature of the file but even when I use a high RAM box e.g c3.8xlarge, it still doesnot use more executors. the executor memory being assigned is 45 GB and the executor cores as 31.

Streaming from CSV files with Spark

I am trying to use Spark Streaming to collect data from CSV files located on NFS.
The code I have is very simple, and so far I have been running it only in spark-shell, but even there I am running into some issues.
I am running spark-shell with a standalone Spark master with 6 workers, and passing the following arguments to spark-shell:
--master spark://master.host:7077 --num-executors 3 --conf spark.cores.max=10
This is the code:
val schema = spark.read.option("header", true).option("mode", "PERMISSIVE").csv("/nfs/files_to_collect/schema/schema.csv").schema
val data = spark.readStream.option("header", true).schema(schema).csv("/nfs/files_to_collect/jobs/jobs*")
val query = data.writeStream.format("console").start()
There are 2 files in that NFS path, each about 200MB in size.
When I call writeStream, I get the following warning:
"17/11/13 22:56:31 WARN TaskSetManager: Stage 2 contains a task of very large size (106402 KB). The maximum recommended task size is 100 KB."
Looking in the Spark master UI, I see that only one executor was used - four tasks were created, each reading ~50% of each CSV file.
My questions are:
1) The more files there are in the NFS path, the more memory the driver seems to need - with 2 files, it would crash until I increased its memory to 2g. With 4 files it needs no less than 8g. What is the driver doing that it needs so much memory?
2) How do I control the parallelism of reading the CSV files? I noticed that the more files there are, the more tasks are created, but is it possible to control this manually?

Resources