Spark Cluster configuration - apache-spark

I'm using a spark cluster with of two nodes each having two executors(each using 2 cores and 6GB memory).
Is this a good cluster configuration for a faster execution of my spark jobs?
I am kind of new to spark and I am running a job on 80 million rows of data which includes shuffling heavy tasks like aggregate(count) and join operations(self join on a dataframe).
Bottlenecks:
Showing Insufficient resources for my executors while reading the data.
On a smaller dataset, it's taking a lot of time.
What should be my approach and how can I do away with my bottlenecks?
Any suggestion would be highly appreciable.
query= "(Select x,y,z from table) as df"
jdbcDF = spark.read.format("jdbc").option("url", mysqlUrl) \
.option("dbtable", query) \
.option("user", mysqldetails[2]) \
.option("password", mysqldetails[3]) \
.option("numPartitions", "1000")\
.load()
This gives me a dataframe which on jdbcDF.rdd.getNumPartitions() gives me value of 1. Am I missing something here?. I think I am not parallelizing my dataset.

There are different ways to improve the performance of your application. PFB some of the points which may help.
Try to reduce the number of records and columns for processing. As you have mentioned you are new to spark and you might not need all 80 million rows, so you can filter the rows to whatever you require. Also, select the columns which is required but not all.
If you are using some data frequently then try considering caching the data, so that for the next operation it will be read from the memory.
If you are joining two DataFrames and if one of them is small enough to fit in memory then you can consider broadcast join.
Increasing the resources might not improve the performance of your application in all cases, but looking at your configuration of the cluster, it should help. It might be good idea to throw some more resources and check the performance.
You can also try using Spark UI to monitor your application and see if there are few task which are taking long time than others. Then probably you need to deal with skewness of your data.
You can try considering to Partition your data based on the columns which you are using in your filter criteria.

Related

Spark-redis: dataframe writing times too slow

I am an Apache Spark/Redis user and recently I tried spark-redis for a project. The program is generating PySpark dataframes with approximately 3 million lines, that I am writing in a Redis database using the command
df.write \
.format("org.apache.spark.sql.redis") \
.option("table", "person") \
.option("key.column", "name") \
.save()
as suggested at the GitHub project dataframe page.
However, I am getting inconsistent writing times for the same Spark cluster configuration (same number of EC2 instances and instance types). Sometimes it happens very fast, sometimes too slow. Is there any way to speed up this process and get consistent writing times? I wonder if it happens slowly when there are a lot of keys inside already, but it should not be an issue for a hash table, should it?
This could be a problem with your partition strategy.
Check Number of Partitions of "df" before writing and see if there is a relation between number of partitions and execution time.
If so, partitioning your "df" with suitable partiton stratigy (Re-partitioning to a fixed number of partitions or re-partitioning based on a column value) should resolve the problem.
Hope this helps.

What is the difference between num-executors,executor-cores,executor-memory in spark-submit from option(“numPartitions”,x) in spark.read? [duplicate]

This question already has answers here:
What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?
(4 answers)
Closed 4 years ago.
I read an RDBMS tables on PostgreSQL DB as:
val dataDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",10)
.load()
The option: numPartitions denote number of partitions your data is split into and then process each partition in parallel manner, in this case it is: 10. I thought this is a cool option in spark until I came across the awesome feature of spark-submit: --num-executors, --executor-cores, --executor-memory. I read the concept of the three aforementioned parameters in spark-submit from this link: here
What I don't understand is, if both are used for parallel processing, how different are both to each other ?
Could anyone let me know the difference between the above mentioned options ?
In read.jdbc(..numPartitions..), numPartitions is the number of partitions your data (Dataframe/Dataset) has. In other words, all subsequent operations on the read Dataframe will have a degree of parallelism equal to numPartitions. (This option also controls the number of parallel connections made to your JDBC source.)
To understand --num-executors, --executor-cores, --executor-memory, you should understand the concept of a Task. Every operation you perform on a Dataframe (or Dataset) converts to a Task on a partition of the Dataframe. Thus, one Task exists for each operation on each partition of the data.
Task execute on an Executor. --num-executors control the number of executors which will be spawned by Spark; thus this controls the parallelism of your Tasks. The other two options, --executor-cores and --executor-memory control the resources you provide to each executor. This depends, among other things, on the number of executors you wish to have on each machine.
P.S: This assumes that you are manually allocating the resources. Spark is also capable of dynamic allocation.
For more information on this, you could use these links:
Spark Configuration
Related StackOverflow Post
EDITS:
The following statement has an important caveat:
all subsequent operations on the read Dataframe will have a degree of parallelism equal to numPartitions.
Operations such as joins and aggregations (which involve shuffle) as well as operations such as union (which does not shuffle data) could change the partition factor.

How to avoid writing empty json files in Spark [duplicate]

I am reading from Kafka queue using Spark Structured Streaming. After reading from Kafka I am applying filter on the dataframe. I am saving this filtered dataframe into a parquet file. This is generating many empty parquet files. Is there any way I can stop writing an empty file?
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KafkaServer) \
.option("subscribe", KafkaTopics) \
.load()
Transaction_DF = df.selectExpr("CAST(value AS STRING)")
decompDF = Transaction_DF.select(zip_extract("value").alias("decompress"))
filterDF = decomDF.filter(.....)
query = filterDF .writeStream \
.option("path", outputpath) \
.option("checkpointLocation", RawXMLCheckpoint) \
.start()
Is there any way I can stop writing an empty file.
Yes, but you would rather not do it.
The reason for many empty parquet files is that Spark SQL (the underlying infrastructure for Structured Streaming) tries to guess the number of partitions to load a dataset (with records from Kafka per batch) and does this "poorly", i.e. many partitions have no data.
When you save a partition with no data you will get an empty file.
You can use repartition or coalesce operators to set the proper number of partitions and reduce (or even completely avoid) empty files. See Dataset API.
Why would you not do it? repartition and coalesce may incur performance degradation due to the extra step of shuffling the data between partitions (and possibly nodes in your Spark cluster). That can be expensive and not worth doing it (and hence I said that you would rather not do it).
You may then be asking yourself, how to know the right number of partitions? And that's a very good question in any Spark project. The answer is fairly simple (and obvious if you understand what and how Spark does the processing): "Know your data" so you can calculate how many is exactly right.
I recommend using repartition(partitioningColumns) on the Dataframe resp. Dataset and after that partitionBy(partitioningColumns) on the writeStream operation to avoid writing empty files.
Reason:
The bottleneck if you have a lot of data is often the read performance with Spark if you have a lot of small (or even empty) files and no partitioning. So you should definitely make use of the file/directory partitioning (which is not the same as RDD partitioning).
This is especially a problem when using AWS S3.
The partitionColumns should fit your common queries when reading the data like timestamp/day, message type/Kafka topic, ...
See also the partitionBy documentation on http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/, year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0.
you can try with repartitionByRange(column)..
I used this while writing dataframe to HDFS .. It solved my empty file creation issue.
If you are using yarn client mode, then setting the num of executor cores to 1 will solve the problem. This means that only 1 task will be run at any time per executor.

How to avoid empty files while writing parquet files?

I am reading from Kafka queue using Spark Structured Streaming. After reading from Kafka I am applying filter on the dataframe. I am saving this filtered dataframe into a parquet file. This is generating many empty parquet files. Is there any way I can stop writing an empty file?
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KafkaServer) \
.option("subscribe", KafkaTopics) \
.load()
Transaction_DF = df.selectExpr("CAST(value AS STRING)")
decompDF = Transaction_DF.select(zip_extract("value").alias("decompress"))
filterDF = decomDF.filter(.....)
query = filterDF .writeStream \
.option("path", outputpath) \
.option("checkpointLocation", RawXMLCheckpoint) \
.start()
Is there any way I can stop writing an empty file.
Yes, but you would rather not do it.
The reason for many empty parquet files is that Spark SQL (the underlying infrastructure for Structured Streaming) tries to guess the number of partitions to load a dataset (with records from Kafka per batch) and does this "poorly", i.e. many partitions have no data.
When you save a partition with no data you will get an empty file.
You can use repartition or coalesce operators to set the proper number of partitions and reduce (or even completely avoid) empty files. See Dataset API.
Why would you not do it? repartition and coalesce may incur performance degradation due to the extra step of shuffling the data between partitions (and possibly nodes in your Spark cluster). That can be expensive and not worth doing it (and hence I said that you would rather not do it).
You may then be asking yourself, how to know the right number of partitions? And that's a very good question in any Spark project. The answer is fairly simple (and obvious if you understand what and how Spark does the processing): "Know your data" so you can calculate how many is exactly right.
I recommend using repartition(partitioningColumns) on the Dataframe resp. Dataset and after that partitionBy(partitioningColumns) on the writeStream operation to avoid writing empty files.
Reason:
The bottleneck if you have a lot of data is often the read performance with Spark if you have a lot of small (or even empty) files and no partitioning. So you should definitely make use of the file/directory partitioning (which is not the same as RDD partitioning).
This is especially a problem when using AWS S3.
The partitionColumns should fit your common queries when reading the data like timestamp/day, message type/Kafka topic, ...
See also the partitionBy documentation on http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
year=2016/month=01/, year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0.
you can try with repartitionByRange(column)..
I used this while writing dataframe to HDFS .. It solved my empty file creation issue.
If you are using yarn client mode, then setting the num of executor cores to 1 will solve the problem. This means that only 1 task will be run at any time per executor.

What is an optimized way of joining large tables in Spark SQL

I have a need of joining tables using Spark SQL or Dataframe API. Need to know what would be optimized way of achieving it.
Scenario is:
All data is present in Hive in ORC format (Base Dataframe and Reference files).
I need to join one Base file (Dataframe) read from Hive with 11-13 other reference file to create a big in-memory structure (400 columns) (around 1 TB in size)
What can be best approach to achieve this? Please share your experience if some one has encounter similar problem.
My default advice on how to optimize joins is:
Use a broadcast join if you can (see this notebook). From your question it seems your tables are large and a broadcast join is not an option.
Consider using a very large cluster (it's cheaper that you may think). $250 right now (6/2016) buys about 24 hours of 800 cores with 6Tb RAM and many SSDs on the EC2 spot instance market. When thinking about total cost of a big data solution, I find that humans tend to substantially undervalue their time.
Use the same partitioner. See this question for information on co-grouped joins.
If the data is huge and/or your clusters cannot grow such that even (3) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
Side note: I say "appending" above because in production I never use SaveMode.Append. It is not idempotent and that's a dangerous thing. I use SaveMode.Overwrite deep into the subtree of a partitioned table tree structure. Prior to 2.0.0 and 1.6.2 you'll have to delete _SUCCESS or metadata files or dynamic partition discovery will choke.
Hope this helps.
Spark uses SortMerge joins to join large table. It consists of hashing each row on both table and shuffle the rows with the same hash into the same partition. There the keys are sorted on both side and the sortMerge algorithm is applied. That's the best approach as far as I know.
To drastically speed up your sortMerges, write your large datasets as a Hive table with pre-bucketing and pre-sorting option (same number of partitions) instead of flat parquet dataset.
tableA
.repartition(2200, $"A", $"B")
.write
.bucketBy(2200, "A", "B")
.sortBy("A", "B")
.mode("overwrite")
.format("parquet")
.saveAsTable("my_db.table_a")
tableb
.repartition(2200, $"A", $"B")
.write
.bucketBy(2200, "A", "B")
.sortBy("A", "B")
.mode("overwrite")
.format("parquet")
.saveAsTable("my_db.table_b")
The overhead cost of writing pre-bucketed/pre-sorted table is modest compared to the benefits.
The underlying dataset will still be parquet by default, but the Hive metastore (can be Glue metastore on AWS) will contain precious information about how the table is structured. Because all possible "joinable" rows are colocated, Spark won't shuffle the tables that are pre-bucketd (big savings!) and won't sort the rows within the partition of table that are pre-sorted.
val joined = tableA.join(tableB, Seq("A", "B"))
Look at the execution plan with and without pre-bucketing.
This will not only save you a lot of time during your joins, it will make it possible to run very large joins on relatively small cluster without OOM. At Amazon, we use that in prod most of the time (there are still a few cases where it is not required).
To know more about pre-bucketing/pre-sorting:
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
https://data-flair.training/blogs/bucketing-in-hive/
https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/
https://databricks.com/session/hive-bucketing-in-apache-spark
Partition the source use hash partitions or range partitions or you can write custom partitions if you know better about the joining fields. Partition will help to avoid repartition during joins as spark data from same partition across tables will exist in same location.
ORC will definitely help the cause.
IF this is still causing spill, try using tachyon which will be faster than disk

Resources