Apache Ignite takes forever to save values on Spark - apache-spark

I am using using Apache Ignite with Spark to save results from Spark, however, when I execute saveValues, it takes very long time and the computer's CPU and fan speed goes insane. I have 3GHz CPU and 16 GB memory.
I have an RDD in which I map the final DataFrame in it:
val visitsAggregatedRdd :RDD[VisitorsSchema] = aggregatedVenuesDf.rdd.map(....)
println("COUNT: " + visitsAggregatedRdd.count().toString())
visitsCache.saveValues(visitsAggregatedRdd)
The total count of rows is 71 which means Spark has already done processing data and it is quite small; 71 rows each one is small object with few numbers and very short strings. So why 'visitsCache.saveValues' is taking this infinite time and processing!?

It turned out to be a problem in Spark Dataframe partitions. Spark was saving those 71 rows were saved in 6000 partitions! A simple solution is to reduce the number of partitions before saving in Ignite:
df = df.coalesce(1)

Related

Spark Broadcast results in increased size of dataframe

I have a dataframe of 1 integer column made of 1B rows. So ideally, the size of the dataframe should be 1B * 4 bytes ~= 4GB. This is proven to be correct when I cache the dataframe and check the size. The size is around 4GB.
Now, if I try to broadcast the same dataframe to join with another dataframe, I get an error: Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 14 GB
Why does the size of a broadcasted dataframe increase? I have seen this in other cases as well where a 300MB dataframe shows up as 3GB broadcasted dataframe in Spark UI SQL tab.
Any reasoning or help is appreciated.
The size increases in memory, if dataframe was broadcasted across your cluster. How much it will increase depends on how many workers you have, because Spark needs to copy your dataframe on every worker to deal with your next operations.
Do not broadcast big dataframes, only small ones, to use in join operations.
As per link:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
More so an error according to this post. See https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-37321

Why is spark dataframe repartition faster than coalesce when reducing number of partitions?

I have a df with 100 partitions, and before saving to HDFS as .parquet I want to reduce the number of partitions because the parquet files would be too small (<1MB).
I've added coalesce before writing:
df.coalesce(3).write.mode("append").parquet(OUTPUT_LOC)
It works but slows down the process from 2-3s per file to 10-20s per file.
When I try repartition:
df.repartition(3).write.mode("append").parquet(OUTPUT_LOC)
The process does not slow down at all, 2-3s per file.
Why? Shouldn't coalesce always be faster when reducing the number of partitions because it avoids a full shuffle?
Background:
I'm importing files from local storage to spark cluster and saving the resulting dataframes as a parquet file. Each file is approx 100-200MB.
Files are located on the "spark-driver" machine, I'm running spark-submit in client deploy mode.
I'm reading files one by one in driver:
data = read_lines(file_name)
rdd = sc.parallelize(data,100)
rdd2 = rdd.flatMap(lambda j: myfunc(j))
df = rdd2.toDF(mySchema)
df.repartition(3).write.mode("append").parquet(OUTPUT_LOC)
Spark version is 3.1.1
Spark/HDFS cluster has 5 workers with 8CPU,32GB RAM
Each executor has 4cores and 15GB RAM, that makes 10 executors total.
EDIT:
When I use coalesce(1) I get spark.rpc.message.maxSize limit breached error, but not when I use repartition(1). Could that be a clue?
Attaching DAG visualizations .. Looks like WholeStageCodegen part is taking too long on coalesce DAGs?
This can happen sometimes if your data is not evenly distributed and when you do coalesce it tries to reduce the partitions by combining the small partitions in order to reduce full shuffle but there could still be some data skew in one of the partition and that single partition would be taking the most of the time.
While you do repartition the data gets distributed almost evenly on all the partitions as it does full shuffle and all the tasks would almost get completed in the same time.
You could use the spark UI to see why when you are doing coalesce what is happening in terms of tasks and do you see any single task running long.

Spark Performance Issue vs Hive

I am working on a pipeline that will run daily. It includes joining 2 tables say x & y ( approx. 18 MB and 1.5 GB sizes respectively) and loading the output of the join to final table.
Following are the facts about the environment,
For table x:
Data size: 18 MB
Number of files in a partition : ~191
file type: parquet
For table y:
Data size: 1.5 GB
Number of files in a partition : ~3200
file type: parquet
Now the problem is:
Hive and Spark are giving same performance (time taken is same)
I tried different combination of resources for spark job.
e.g.:
executors:50 memory:20GB cores:5
executors:70 memory:20GB cores:5
executors:1 memory:20GB cores:5
All three combinations are giving same performance. I am not sure what I am missing here.
I also tried broadcasting the small table 'x' so as to avoid shuffle while joining but not much improvement in performance.
One key observations is:
70% of the execution time is consumed for reading the big table 'y' and I guess this is due to more number of files per partition.
I am not sure how hive is giving the same performance.
Kindly suggest.
I assume you are comparing Hive on MR vs Spark. Please let me know if it is not the case.Because Hive(on tez or spark) vs Spark Sql will not differ
vastly in terms of performance.
I think the main issue is that there are too many small files.
A lot of CPU and time is consumed in the I/O itself, hence you can't experience the processing power of Spark.
My advice is to coalesce the spark dataframes immedietely after reading the parquet files. Please coalesce the 'x' dataframe into single partition and 'y'
dataframe into 6-7 partitions.
After doing the above, please perform the join(broadcastHashJoin).

Data Processing in Parallel using Apache Spark with Pyspark

I have a daily level transaction dataset for three months going upto around 17 gb combined. Now I have a server with 16 cores and 64gb RAM with 1 tb of hardisk space. I have the transaction data broken into 90 files each having the same format and a set of queries which is to be run in this entire dataset and the query for each daily level data is the same for all 90 files. The result after the query is run is appended and then we get the resultant summary back. Now before I start on my endevour I was wondering if Apache Spark with pyspark can be used to solve this. I tried R but it was very slow and ultimately I got memory outage issue
So my question has two parts
How should I create my RDD? Should I pass my entire dataset as an RDD or is there any way I can tell spark to work in Parallel in these 90 datsets
2.Can I expect a significant speed improvement if I am not working with Hadoop

increasing number of partitions in spark

I was using Hive for executing SQL queries on a project. I used ORC with 50k Stride for my data and have created the hive ORC tables using this configuration with a certain date column as partition.
Now I wanted to use Spark SQL to benchmark the same queries operating on the same data.
I executed the following query
val q1 = sqlContext.sql("select col1,col2,col3,sum(col4),sum(col5) from mytable where date_key=somedatkye group by col1,col2,col3")
In hive it takes 90 seconds for this query. But spark takes 21 minutes for the same query and on looking at the job, i found the issue was because Spark creates 2 stages and on the first stage, it has only 7 tasks, one each for each of the 7 blocks of data within that given partition in orc file. The blocks are of different size, one is 5MB while the other is 45MB and because of this stragglers take more time leading to taking too much time for the job.
How do i mitigate this issue in spark. How do i manually increase the number of partitions, resulting in increasing the number of tasks in stage 1, even though there are only 7 physical blocks for the given range of the query.

Resources