Streaming from CSV files with Spark - apache-spark

I am trying to use Spark Streaming to collect data from CSV files located on NFS.
The code I have is very simple, and so far I have been running it only in spark-shell, but even there I am running into some issues.
I am running spark-shell with a standalone Spark master with 6 workers, and passing the following arguments to spark-shell:
--master spark://master.host:7077 --num-executors 3 --conf spark.cores.max=10
This is the code:
val schema = spark.read.option("header", true).option("mode", "PERMISSIVE").csv("/nfs/files_to_collect/schema/schema.csv").schema
val data = spark.readStream.option("header", true).schema(schema).csv("/nfs/files_to_collect/jobs/jobs*")
val query = data.writeStream.format("console").start()
There are 2 files in that NFS path, each about 200MB in size.
When I call writeStream, I get the following warning:
"17/11/13 22:56:31 WARN TaskSetManager: Stage 2 contains a task of very large size (106402 KB). The maximum recommended task size is 100 KB."
Looking in the Spark master UI, I see that only one executor was used - four tasks were created, each reading ~50% of each CSV file.
My questions are:
1) The more files there are in the NFS path, the more memory the driver seems to need - with 2 files, it would crash until I increased its memory to 2g. With 4 files it needs no less than 8g. What is the driver doing that it needs so much memory?
2) How do I control the parallelism of reading the CSV files? I noticed that the more files there are, the more tasks are created, but is it possible to control this manually?

Related

How Spark is writing compressed parquet file?

Using Apache Spark 1.6.4, with elasticsearch4hadoop plugin, I am exporting an elasticsearch index (100m documents, 100Go, 5 shards) into a gzipped parquet file, within HDFS 2.7.
I run this ETL as a Java program, with 1 executor (8 CPU, 12Go RAM).
The process of 5 tasks (because the 5 ES shards) takes about 1 hour, works fine most of the time, but sometime, I can see some Spark tasks failed because out of memory error.
During the process, I can see in the HDFS some temporary files, but they are always 0 sized.
Q: I am wondering if Spark is saving the data in memory before writing the gz.parquet file?

Spark only running one executor for big gz files

I have input source files(compressed .gz) which I need to process using Spark. Each file is 5 GBs (compressed gz) and there are around 11-12 files.
But when I give the source as input, spark just launches one executor. I understand that this may be due to the non-splittable nature of the file but even when I use a high RAM box e.g c3.8xlarge, it still doesnot use more executors. the executor memory being assigned is 45 GB and the executor cores as 31.

What is the correct way to query Hive on Spark for maximum performance?

Spark newbie here.
I have a pretty large table in Hive (~130M records, 180 columns) and I'm trying to use Spark to pack it as a parquet file.
I'm using the default EMR cluster configuration, 6 * r3.xlarge instances to submit my spark application written in Python. I then run it on YARN, in a cluster mode, usually giving a small amount of memory (couple of gb) to driver, and the rest of it to executors. Here's my code to do so:
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext(appName="ParquetTest")
hiveCtx = HiveContext(sc)
data = hiveCtx.sql("select * from my_table")
data.repartition(20).write.mode('overwrite').parquet("s3://path/to/myfile.parquet")
Later, I submit it with something similar to this:
spark-submit --master yarn --deploy-mode cluster --num-executors 5 --driver-memory 4g --driver-cores 1 --executor-memory 24g --executor-cores 2 --py-files test_pyspark.py test_pyspark.py
However, my task takes forever to complete. Spark shuts down all but one worker very quickly after the job starts, since others are not being used, and it takes a few hours before it has all the data from Hive. The Hive table itself is not partitioned or clustered yet (I also need some advices on that).
Could you help me understand what I'm doing wrong, where should I go from here and how to get the maximum performance out of resources I have?
Thank you!
I had similar use case where I used spark to write to s3 and had performance issue. Primary reason was spark was creating lot of zero byte part files and replacing temp files to actual file name was slowing down the write process. Tried below approach as work around
Write output of spark to HDFS and used Hive to write to s3. Performance was much better as hive was creating less number of part files. Problem I had is(also had same issue when using spark), delete action on Policy was not provided in prod env because of security reasons. S3 bucket was kms encrypted in my case.
Write spark output to HDFS and Copied hdfs files to local and used aws s3 copy to push data to s3. Had second best results with this approach. Created ticket with Amazon and they suggested to go with this one.
Use s3 dist cp to copy files from HDFS to S3. This was working with no issues, but not performant

EMR 5.x | Spark on Yarn | Exit code 137 and Java heap space Error

I have been getting this error Container exited with a non-zero exit code 137 while running spark on yarn. I have tried couple of techniques after going through but didnt help. The spark configurations looks like below:
spark.driver.memory 10G
spark.driver.maxResultSize 2G
spark.memory.fraction 0.8
I am using yarn in client mode. spark-submit --packages com.databricks:spark-redshift_2.10:0.5.0 --jars RedshiftJDBC4-1.2.1.1001.jar elevatedailyjob.py > log5.out 2>&1 &
The sample code :
# Load the file (its a single file of 3.2GB)
my_df = spark.read.csv('s3://bucket-name/path/file_additional.txt.gz', schema=MySchema, sep=';', header=True)
# write the de_pulse_ip data into parquet format
my_df = my_df.select("ip_start","ip_end","country_code","region_code","city_code","ip_start_int","ip_end_int","postal_code").repartition(50)
my_df.write.parquet("s3://analyst-adhoc/elevate/tempData/de_pulse_ip1.parquet", mode = "overwrite")
# read my_df data intp dataframe from parquet files
my_df1 = spark.read.parquet("s3://bucket-name/path/my_df.parquet").repartition("ip_start_int","ip_end_int")
#join with another dataset 200 MB
my_df2 = my_df.join(my_df1, [my_df.ip_int_cast > my_df1.ip_start_int,my_df.ip_int_cast <= my_df1.ip_end_int], how='right')
Note: the input file is a single gzip file. It's unzipped size is 3.2GB
Here is the solution for the above issues.
exit code 137 and Java heap space Error is mainly related to memory w.r.t the executors and the driver. Here is something I have done
to increase the driver memory spark.driver.memory 16G increase
the storage memory fraction spark.storage.memoryFraction 0.8
increase the executor memory spark.executor.memory 3G
One very important thing I would like to share which actually made a huge impact in performance is something like below:
As I mentioned above I have a single file (.csv and gzip of 3.2GB) which after unzipping becomes 11.6 GB.
To load gzip files, spark always starts a single executor(for each .gzip file) as it can not parallelize (even if you increase partitions) as gzip files are not splittable. this hampers the whole performance as spark first read the whole file (using one executor) into the master ( I am running spark-submit in client mode) and then uncompress it and then repartitions (if mentioned to re-partition).
To address this, I used s3-dist-cp command and moved the file from s3 to hdfs and also reduced the block size so as to increase the parallelism. something like below
/usr/bin/s3-dist-cp --src=s3://bucket-name/path/ --dest=/dest_path/ --groupBy='.*(additional).*' --targetSize=64 --outputCodec=none
although, it takes little time to move the data from s3 to HDFS, the overall performance of the process increases significantly.

Tuning Spark (YARN) cluster for reading 200GB of CSV files (pyspark) via HDFS

I'm working with an 11-node cluster running on AWS right now (1 master, 10 workers - c3.4xlarge) and I'm attempting to read in ~200GB of .csv files (only about 10 actual .csv files) from HDFS.
This process is going really slowly. I'm watching spark at the command line and it looks like this..
[Stage 0:> (30 + 2) / 2044]
With an increase of +2 units (meaning 30+2 going to 32+2 to 34+2, etc..) of progress maybe every 20 seconds. So this is in serious need of improvement, or else we'll be here for about 11 hours before the files are even done reading.
This is the code to this point..
# AMAZON AWS EMR
def sparkconfig():
conf = SparkConf()
conf.setMaster("yarn-client) #client gets output to terminals
conf.set("spark.default.parallelism",340)
conf.setAppName("my app")
conf.set("spark.executor.memory", "20g")
return conf
sc = SparkContext(conf=sparkconfig(),
pyFiles=['/home/hadoop/temp_files/redis.zip'])
path = 'hdfs:///tmp/files/'
all_tx = sc.textFile(my_path).coalesce(1024)
... more code for processing
Now clearly 1024 for partitions may not be correct, that was just from googling and trying different things. I'm really at a loss when it comes to tuning this job.
The worker nodes # AWS are c3.4xlarge instances (I have 10 in the cluster) consist of 30GB of RAM with 16 vCPU's each. The HDFS partition is made up of the local storage for each node in the cluster, which is 2x160GB SSD's, so I believe we're looking at (2*160GB * 10nodes / 3 replication) = ~1TB of HDFS.
The .csv files themselves range from 5GB to 90GB in size.
To clarify in case it is relevant, the Hadoop cluster is the same as the spark cluster as far as the nodes go. I'm allocating 20GB of the 30GB per node to the spark executor, leaving 10GB per node for OS + Hadoop/YARN, etc.. The name-node / spark parent node is a m3.xlarge, which has 4 vcpu's and 16GB of RAM.
Does anyone have suggestions on tuning options (or anything, really) that I might try to speed up this file-read process?
Shameless Plug (Author) try Sparklens https://github.com/qubole/sparklens
Most of the time the real question is not if the application is slow, but will it scale. And for most of the applications, answer is upto a limit.
The structure of spark application puts important constraints on its scalability. Number of tasks in a stage, dependencies between stages, skew and amount of work done on the driver side are the main constraints.

Resources