Spark based processing of data stored on SSD - apache-spark

We are currently using Spark 2.1 based application which analyses and process huge number of records to generate some stats which is used for report generation. Now our we are using 150 executors, 2 core per executor and 10 GB per executor for our spark jobs and size of data is ~3TB stored in parquet format. For processing 12 months of data it is taking ~15 mins of time.
Now to improve performance we want to try full SSD based node to store data in HDFS. Well the question is, are there any special configuration/optimisation to be done for SSD? Are there any study done for Spark processing performance on SSD based HDFS vs HDD based HDFS?

http://spark.apache.org/docs/latest/hardware-provisioning.html#local-disks
SPARK_LOCAL_DIRS is config that you need to change.
https://www.slideshare.net/databricks/optimizing-apache-spark-throughput-using-intel-optane-and-intel-memory-drive-technology-with-ravikanth-durgavajhala
Use case is K means algo but will help.

Related

pyspark java.lang.OutOfMemoryError: GC overhead limit exceeded

I'm trying to process, 10GB of data using spark it is giving me this error,
java.lang.OutOfMemoryError: GC overhead limit exceeded
Laptop configuration is: 4CPU, 8 logical cores, 8GB RAM
Spark configuration while submitting the spark job.
spark = SparkSession.builder.master('local[6]').config("spark.ui.port", "4041").appName('test').getOrCreate()
spark.conf.set("spark.executor.instances", 1)
spark.conf.set("spark.executor.cores", 5)
After searching internet about this error, I have few questions
If answered that would be a great help.
1) Spark is in memory computing engine, for processing 10 gb of data, the system should have 10+gb of RAM. Spark loads 10gb of data into 10+ gb RAM memory and then do the job?
2) If point 1 is correct, how big companies are processing 100s of TBs of data, are they processing 100TB of data by clustering multiple systems to form 100+TB RAM and then process 100TB of data?
3) Is their no other way to process 50gb of data with 8gb RAM and 8Cores, by setting proper spark configurations? If it is what is the way and what should be the spark configurations.
4) What should be ideal spark configuration if the system properites are 8gb RAM and 8 Cores? for processing 8gb of data
spark configuration to be defined in spark config.
spark = SparkSession.builder.master('local[?]').config("spark.ui.port", "4041").appName('test').getOrCreate()
spark.conf.set("spark.executor.instances", ?)
spark.conf.set("spark.executor.cores", ?)
spark.executors.cores = ?
spark.executors.memory = ?
spark.yarn.executor.memoryOverhead =?
spark.driver.memory =?
spark.driver.cores =?
spark.executor.instances =?
No.of core instances =?
spark.default.parallelism =?
I hope the following will help if not clarify everything.
1) Spark is an in-memory computing engine, for processing 10 GB of data, the system should have 10+gb of RAM. Spark loads 10gb of data into 10+ GB RAM memory and then do the job?
Spark being an in-memory computation engine take the input/source from an underlying data lake or distributed storage system. The 10Gb file will be broken into smaller blocks (128Mb or 256Mb block size for Hadoop based data lake) and Spark Driver will get many executor/cores to read them from the cluster's worker node. If you try to load 10Gb data with laptop or with a single node, it will certainly go out of memory. It has to load all the data either in one machine or in many slaves/worker-nodes before it is processed.
2) If point 1 is correct, how big companies are processing 100s of TBs of data, are they processing 100TB of data by clustering multiple systems to form 100+TB RAM and then process 100TB of data?
The large data processing project design the storage and access layer with a lot of design patterns. They simply don't dump GBs or TBs of data to file system like HDFS. They use partitions (like sales transaction data is partition by month/week/day) and for structured data, there are different file formats available (especially columnar) which helps to lad only those columns which are required for processing. So right file format, partitioning, and compaction are the key attributes for large files.
3) Is their no other way to process 50gb of data with 8gb RAM and 8Cores, by setting proper spark configurations? If it is what is the way and what should be the spark configurations.
Very unlikely if there is no partition but there are ways. It also depends on what kind of file it is. You can create a custom stream file reader that can read the logical block and process it. However, the enterprise doesn't read 50Gb of one file which is one single unit. Even if you load an excel file of 10Gb in your machine via Office tool, it will go out of memory.
4) What should be ideal spark configuration if the system properties are 8gb RAM and 8 Cores? for processing 8gb of data
Leave 1 core & 1-Gb or 2-GB for OS and use the rest of them for your processing. Now depends on what kind of transformation is being performed, you have to decide the memory for driver and worker processes. Your driver should have 2Gb of RAM. But laptop is primarily for the playground to explore the code syntax and not suitable for large data set. Better to build your logic with dataframe.sample() and then push the code to bigger machine to generate the output.

Apache Spark ---- how spark reads large partitions from source when there is no enough memory

Suppose my data source contains data in 5 partitions each partition size is 10gb ,so total data size 50gb , my doubt here is ,when my spark cluster doesn't have 50gb of main memory how spark handles out of memory exceptions , and what is the best practice to avoid these scenarios in spark.
50GB is data that can fit in memory and you probably don't need Spark for this kind of data - it would run slower than other solutions.
Also depending on the job and data format, a lot of times, not all the data needs to be read into memory (e.g. reading just needed columns from columnar storage format like parquet)
Generally speaking - when the data can't fit in memory Spark will write temporary files to disk. you may need to tune the job to more smaller partitions so each individual partition will fit in memory. see Spark Memory Tuning
Arnon

Spark behavior on native file system

We are experimenting to run Spark in our project without Hadoop and no distributed storage like HDFS. Spark is installed on a single node with 10 Cores and 16GB RAM and this node is not part of any cluster. Assuming Spark driver takes 2 cores and the rest of them are consumed by executors(2 each) at the time of execution.
If we process a big CSV file (of size 1 GB) stored in local disk in Spark as RDD and repartition it to 4 different partitions, will executors process each partition in parallel?
What would executors do if we don't repartition the RDD to 4 diff partitions?
Do we loose the power of distributed computing and parallelism if dont use HDFS?
Spark caps the maximum size of a partition at 2G, so you should be able to process the entire data with minimal partitioning and quicker processing time. You can set spark.executor.cores to 8 so as to utilize all you resources.
Ideally, you should set the number of partitions depending on the size of your data, and you are better off setting the number of partitions as a multiple of cores/executors.
To answer your question, setting number of partitions to 4 in your case will probably result in each partition being sent to an executor. So yes, each partition will be processed in parallel.
If you don't repartition, then Spark will do it for you depending on the data and split the load between the executors.
Spark works perfectly fine without Hadoop. You might see a negligible performance drop since your files are on the local filesystem and not on HDFS, but for a file of size 1GB it really doesn't matter.

Data Processing in Parallel using Apache Spark with Pyspark

I have a daily level transaction dataset for three months going upto around 17 gb combined. Now I have a server with 16 cores and 64gb RAM with 1 tb of hardisk space. I have the transaction data broken into 90 files each having the same format and a set of queries which is to be run in this entire dataset and the query for each daily level data is the same for all 90 files. The result after the query is run is appended and then we get the resultant summary back. Now before I start on my endevour I was wondering if Apache Spark with pyspark can be used to solve this. I tried R but it was very slow and ultimately I got memory outage issue
So my question has two parts
How should I create my RDD? Should I pass my entire dataset as an RDD or is there any way I can tell spark to work in Parallel in these 90 datsets
2.Can I expect a significant speed improvement if I am not working with Hadoop

Reducing the latency between Spark and HBase nodes

I am experiencing a high latency between Spark nodes and HBase nodes.
The current resources I have require me to run HBase and Spark on different servers.
The HFiles are compressed with Snappy algorithm, which reduces the data size of each region from 50GB to 10GB.
Nevertheless, the data transferred on the wire is always decompressed, so reading takes lot of time - approximately 20 MB per sec, which is about 45 minutes for each 50GB region.
What can I do to make data reading faster? (Or, is the current throughput considered high for HBase?)
I was thinking to clone the HBase HFiles locally to the Spark machines, instead of continuously requesting data from the HBase. Is it possible?
What is the best practice for solving such an issue?
Thanks
You are thinking in right direction. You can copy HFiles to HDFS cluster(or Machines) where spark is running. That would result in saving decompression and reduced data transfer over the wire. You would need to read HFiles from Snappy compression and write a parser to read.
Alternatively you can apply Column and ColumnFamily filters if you don't need all the data from Hbase.

Resources