Spark Window performance issues - apache-spark

I have a parquet dataframe, with the following structure:
ID String
DATE Date
480 other feature columns of type Double
I have to replace each of the 480 feature columns with their corresponding weighted moving averages, with a window of 250.
Initially, I am trying to do this for a single column, with the following simple code:
var data = sparkSession.read.parquet("s3://data-location")
var window = Window.rowsBetween(-250, Window.currentRow - 1).partitionBy("ID").orderBy("DATE")
data.withColumn("Feature_1", col("Feature_1").divide(avg("Feature_1").over(window))).write.parquet("s3://data-out")
The input data contains 20 Million rows, and each ID has about 4-5000 dates associated.
I have run this on an AWS EMR cluster(m4.xlarge instances), with the following results for one column:
4 executors X 4 cores X 10 GB + 1 GB for yarn overhead (so 2.5GB per task, 16 concurrent running tasks) , took 14 min
8 executors X 4 cores X 10GB + 1 GB for yarn overhead (so 2.5GB per task, 32 concurrent running tasks), took 8 minutes
I have tweaked the following settings, with the hope of bringing the total time down:
spark.memory.storageFraction 0.02
spark.sql.windowExec.buffer.in.memory.threshold 100000
spark.sql.constraintPropagation.enabled false
The second one helped prevent some spilling seen in the logs, but none helped with the actual performance.
I do not understand why it takes so long for just 20 Million records. I know that for computing weighted moving average, it needs to do 20 M X 250 (the window size) averages and divisions, but with 16 cores (first run) I don't see why it would take so long. I can't imagine how long it would take for the rest of the 479 remaining feature columns!
I have also tried increasing the default shuffle paritions, by setting:
spark.sql.shuffle.partitions 1000
but even with 1000 partitions, it didn't bring the time down.
Also tried sorting the data by ID and DATE before calling the window aggregations, without any benefit.
Is there any way to improve this, or window functions generally run slow with my usecase? This is 20M rows only, nowhere near what spark can process with other types of workload..

Your dataset size is approximately 70 GB.
if i understood it correctly for each id it is sorting on date for all the records and then taking the preceding 250 records to do average. As you need to apply this on more than 400 columns, i would recommend trying bucketing while parquet creation to avoid shuffling. it takes considerable amount of time for writing the bucketted parquet file but for all 480 columns derivation that may not take 8 minutes *480 executing time.
please try bucketing or repartition and sortwithin while creating parquet file and let me know if it works.

Related

Understanding sort in pyspark

I am reading two datasets of sizes 9.5GB (df1) and 715.1MB(df2) on disk.
I merge them on a key and then run a global aggregation on a resultant column
This triggers a shuffle and then the shuffled results are sort merged together on the reduce side (This happens internally)
I was trying to get the sort stage to spill some data on the disk and for that I progressively reduced the executor sizes.
As I reduced the executor size, the stage started to consume more time and less memory and minimized the spill. Numbers to look at in the images below are a. "duration" that is right underneath the WholeStageCodegen and, b. peak memory total in the "sort" box
config("spark.executor.instances","6").
config("spark.executor.memory","6G").
config("spark.executormemoryOverhead","2G").
config("spark.memory.offHeap.size","2G").
config("spark.executor.pyspark.memory","2G")
config("spark.executor.instances","6").
config("spark.executor.memory","4G").
config("spark.executormemoryOverhead","2G").
config("spark.memory.offHeap.size","2G").
config("spark.executor.pyspark.memory","2G")
config("spark.executor.instances","6").
config("spark.executor.memory","2G").
config("spark.executormemoryOverhead","2G").
config("spark.memory.offHeap.size","2G").
config("spark.executor.pyspark.memory","2G")
I have spark.sql.adaptive.enabled set to False. I have not touched the shuffle partitions. It remains the default 200 throughout
Questions:
As you can see increasing the executor size did two things: Reduced the memory footprint (peak memory total) and increased the duration. What is happening under the hood? How is this brought about?
I am seeing a duration mentioned right underneath the WholeStageCodegen. I used to think that is the runtime statistics of the entire JAVA wholestagecode generated for the stage. But, here a) There is just one duration given for the second data. There are no statistics like min, med, max as are there for the first data. Why is that so? Also, What is the difference between the sort time total mentioned inside the Sort box and the duration mentioned underneath the WholeStageCodegen?

How can I automate the process of running the same aggregation in 12 parquet files and then join the results in 1 table using PySpark?

I have to make 6 different calculations (sums and averages by day) in a parquet file that contains 1 year of data (day level). The problem is the file is too big and Jupyter crashes in the process. So I divided the file into 12 months (12 parquet files). I tested if the server would be able to make the calculations in 1 month of data in a reasonable time and it did. I want to avoid writing 72 different queries (6 calculations * 12 months). The result of each calculation would have to be saved in a parquet file and then joined in a final table. How would you recommend solving this by automating the process in PySpark? I would appreciate any suggestions. Thanks.
This is an example of the code I have to run in each of the 12 parts of the data:
month1= spark.read.parquet("s3://af/my_folder/month1.parquet")
month1.createOrReplaceTempView("month1")
month1sum= spark.sql("select id, date, sum(sessions) as sum_num_sessions from month1 where group by 1,2 order_by 1 asc")
month1sum.write.mode("overwrite").parquet("s3://af/my_folder/month1sum.parquet")
month1sum.createOrReplaceTempView("month1sum")
month_1_calculation=month1sum.groupBy('date').agg(avg('sum_num_sessions').alias('avg_sessions'))
month_1_calculation.write.mode("overwrite").parquet("s3://af/my_folder/month_1_calculation.parquet")```
Quick approach: how about a for loop?
for i in range(1, 13):
month= spark.read.parquet(f"s3://af/my_folder/month{i}.parquet")
month.createOrReplaceTempView(f"month{i}")
monthsum= spark.sql(f"select id, date, sum(sessions) as sum_num_sessions from month{i} where group by 1,2 order_by 1 asc")
monthsum.write.mode("overwrite").parquet(f"s3://af/my_folder/month{i}sum.parquet")
monthsum.createOrReplaceTempView(f"month{i}sum")
month_calculation = monthsum.groupBy('date').agg(avg('sum_num_sessions').alias('avg_sessions'))
month_calculation.write.mode("overwrite").parquet(f"s3://af/my_folder/month_{i}_calculation.parquet")
Long-term approach: Spark is designed to handle big data, so no matter how big your data is, as long as you have sufficient hardware (number of cores and memory), Spark should be able to take care of it with correct configurations. So adjusting your number of core, executor memory, driver memory, improving parallelism (by changing number of partitions), ... would definitely solve your issue.

How big the spark stream window could be?

I have some data flows need to be calculated. I am thinking about use spark stream to do this job. But there is one thing I am not sure and feel worry about.
My requirements is like :
Data comes in as CSV files every 5 minutes. I need report on data of recent 5 minutes, 1 hour and 1 day. So If I setup a spark stream to do this calculation. I need a interval as 5 minutes. Also I need to setup two window 1 hour and 1 day.
Every 5 minutes there will be 1GB data comes in. So the one hour window will calculate 12GB (60/5) data and the one day window will calculate 288GB(24*60/5) data.
I do not have much experience on spark. So this worries me.
Can spark handle such big window ?
How much RAM do I need to calculation those 288 GB data? More than 288 GB RAM? (I know this may depend on my disk I/O, CPU and the calculation pattern. But I just want some estimated answer based on experience)
If calculation on one day / one hour data is too expensive in stream. Do you have any better suggestion?

How Many Hive Dynamic Partitions are Needed?

I am running a large job that consolidates about 55 streams (tags) of samples (one sample per record) at irregular times over two years into 15-minute averages. There are about 1.1 billion records in 23k streams in the raw dataset, and these 55 streams make up about 33 million of those records.
I calculated a 15-minute index and am grouping by that to get the average value, however I seem to have am exceeded the max dynamic partitions on my hive job in spite of cranking it way up to 20k. I can increase it further I suppose, but it already takes awhile to fail (about 6 hours, although I reduced it to 2 by reducing the number of streams to consider), and I don’t actually know how to calculate how many I really need.
Here is the code:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.max.dynamic.partitions=50000;
SET hive.exec.max.dynamic.partitions.pernode=20000;
DROP TABLE IF EXISTS sensor_part_qhr;
CREATE TABLE sensor_part_qhr (
tag STRING,
tag0 STRING,
tag1 STRING,
tagn_1 STRING,
tagn STRING,
timestamp STRING,
unixtime INT,
qqFr2013 INT,
quality INT,
count INT,
stdev double,
value double
)
PARTITIONED BY (bld STRING);
INSERT INTO TABLE sensor_part_qhr
PARTITION (bld)
SELECT tag,
min(tag),
min(tag0),
min(tag1),
min(tagn_1),
min(tagn),
min(timestamp),
min(unixtime),
qqFr2013,
min(quality),
count(value),
stddev_samp(value),
avg(value)
FROM sensor_part_subset
WHERE tag1='Energy'
GROUP BY tag,qqFr2013;
And here is the error message:
Error during job, obtaining debugging information...
Examining task ID: task_1442824943639_0044_m_000008 (and more) from job job_1442824943639_0044
Examining task ID: task_1442824943639_0044_r_000000 (and more) from job job_1442824943639_0044
Task with the most failures(4):
-----
Task ID:
task_1442824943639_0044_r_000000
URL:
http://headnodehost:9014/taskdetails.jsp?jobid=job_1442824943639_0044&tipid=task_1442824943639_0044_r_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 20000
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:283)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException:
[Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions.
The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode.
Maximum was set to: 20000
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:747)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.startGroup(FileSinkOperator.java:829)
at org.apache.hadoop.hive.ql.exec.Operator.defaultStartGroup(Operator.java:498)
at org.apache.hadoop.hive.ql.exec.Operator.startGroup(Operator.java:521)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:232)
... 7 more
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 520 Reduce: 140 Cumulative CPU: 7409.394 sec HDFS Read: 0 HDFS Write: 393345977 SUCCESS
Job 1: Map: 9 Reduce: 1 Cumulative CPU: 87.201 sec HDFS Read: 393359417 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 days 2 hours 4 minutes 56 seconds 595 msec
Can anyone give some ideas as to how to calculate how many of these dynamic nodes I might need for a job like this?
Or maybe I should be doing this differently? I am running Hive 0.13 by the way on Azure HDInsight.
Update:
Corrected some of the numbers above.
Reduced it to 3 streams operating on 211k records and it finally
succeeded.
Started experimenting, reduced the partitions per node to 5k, and then 1k, and it still succeeded.
So I am not blocked anymore, but I am thinking I would have needed millions of nodes to do the whole dataset in one go (which is what I really wanted to do).
Dynamic partition columns must be specified last among the columns in the SELECT statement during insertion in sensor_part_qhr.

Cassandra Stress Test results evaluation

I have been using the cassandra-stress tool to evaluate my cassandra cluster for quite some time now.
My problem is that I am not able to comprehend the results generated for my specific use case.
My schema looks something like this:
CREATE TABLE Table_test(
ID uuid,
Time timestamp,
Value double,
Date timestamp,
PRIMARY KEY ((ID,Date), Time)
) WITH COMPACT STORAGE;
I have parsed this information in a custom yaml file and used parameters n=10000, threads=100 and the rest are default options (cl=one, mode=native cql3, etc). The Cassandra cluster is a 3 node CentOS VM setup.
A few specifics of the custom yaml file are as follows:
insert:
partitions: fixed(100)
select: fixed(1)/2
batchtype: UNLOGGED
columnspecs:
-name: Time
size: fixed(1000)
-name: ID
size: uniform(1..100)
-name: Date
size: uniform(1..10)
-name: Value
size: uniform(-100..100)
My observations so far are as follows:
With n=10000 and time: fixed(1000), the number of rows getting inserted is 10 million. (10000*1000=10000000)
The number of row-keys/partitions is 10000(i.e n), within which 100 partitions are taken at a time (which means 100 *1000 = 100000 key-value pairs) out of which 50000 key-value pairs are processed at a time. (This is because of select: fixed(1)/2 ~ 50%)
The output message also confirms the same:
Generating batches with [100..100] partitions and [50000..50000] rows (of[100000..100000] total rows in the partitions)
The results that I get are the following for consecutive runs with the same configuration as above:
Run Total_ops Op_rate Partition_rate Row_Rate Time
1 56 19 1885 943246 3.0
2 46 46 4648 2325498 1.0
3 27 30 2982 1489870 0.9
4 59 19 1932 966034 3.1
5 100 17 1730 865182 5.8
Now what I need to understand are as follows:
Which among these metrics is the throughput i.e, No. of records inserted per second? Is it the Row_rate, Op_rate or Partition_rate? If it’s the Row_rate, can I safely conclude here that I am able to insert close to 1 million records per second? Any thoughts on what the Op_rate and Partition_rate mean in this case?
Why is it that the Total_ops vary so drastically in every run ? Has the number of threads got anything to do with this variation? What can I conclude here about the stability of my Cassandra setup?
How do I determine the batch size per thread here? In my example, is the batch size 50000?
Thanks in advance.
Row Rate is the number of CQL Rows that you have inserted into your database. For your table a CQL row is a tuple like (ID uuid, Time timestamp, Value double, Date timestamp).
The Partition Rate is the number of Partitions C* had to construct. A Partition is the data-structure which holds and orders data in Cassandra, data with the same partition key ends up located on the same node. This Partition rate is equal to the number of unique values in the Partition Key that were inserted in the time window. For your table this would be unique values for (ID,Date)
Op Rate is the number of actually CQL operations that had to be done. From your settings it is running unlogged Batches to insert the data. Each insert contains approximately 100 Partitions (Unique combinations of ID and Date) which is why OP Rate * 100 ~= Partition Rate
Total OP should include all operations, read and write. So if you have any read operations those would also be included.
I would suggest changing your batch size to match your workload, or keep it at 1 depending on your actual database usage. This should provide a more realistic scenario. Also it's important to run much longer than just 100 total operations to really get a sense of your system's capabilities. Some of the biggest difficulties come when the size of the dataset increases beyond the amount of RAM in the machine.

Resources