I am writing an algorithm for which I want to add new nodes in the topology after every 1 minute for 5 minutes. Initially the topology contains 5 nodes, so after 5 minutes it should have total 10 nodes. How can I implement this in the simulation script. Which behaviour will be best suited to do this ?
I have the next SLA for my big data project: regardless amount of concurrent spark tasks, the max execution time shouldn't exceed 5 minutes. For example: there are 10 spark concurrent tasks,
the slowest task shall take < 5 mins, as the number of tasks increases I have to be sure that this time wouldn't exceed 5 mins. Using the usual autoscaling is not appropriate here because adding new nodes takes a couple of minutes and it doesn't solve a problem with the exponential growth of tasks (e.g. skew from 10 concurrent tasks to 30 concurrent tasks).
I came across the idea to spin up a new cluster on demand so meet the SLA requirements. Let's say, I found the max number of the concurrent tasks (all of them are almost equal and take the same resources) which can be executed simultaneously on my cluster within 5 mins e.g. - 30 tasks. when the number of tasks approaches the threshold, the new cluster will be spinned up. The idea of this pattern is to struggle slowness during autoscaling and meet the SLA.
My question is next: are there any alternative options to my pattern except autoscaling on a single cluster (it's not suitable for my usecase because of slowness of my spark provider).
I have a parquet dataframe, with the following structure:
ID String
DATE Date
480 other feature columns of type Double
I have to replace each of the 480 feature columns with their corresponding weighted moving averages, with a window of 250.
Initially, I am trying to do this for a single column, with the following simple code:
var data = sparkSession.read.parquet("s3://data-location")
var window = Window.rowsBetween(-250, Window.currentRow - 1).partitionBy("ID").orderBy("DATE")
data.withColumn("Feature_1", col("Feature_1").divide(avg("Feature_1").over(window))).write.parquet("s3://data-out")
The input data contains 20 Million rows, and each ID has about 4-5000 dates associated.
I have run this on an AWS EMR cluster(m4.xlarge instances), with the following results for one column:
4 executors X 4 cores X 10 GB + 1 GB for yarn overhead (so 2.5GB per task, 16 concurrent running tasks) , took 14 min
8 executors X 4 cores X 10GB + 1 GB for yarn overhead (so 2.5GB per task, 32 concurrent running tasks), took 8 minutes
I have tweaked the following settings, with the hope of bringing the total time down:
spark.memory.storageFraction 0.02
spark.sql.windowExec.buffer.in.memory.threshold 100000
spark.sql.constraintPropagation.enabled false
The second one helped prevent some spilling seen in the logs, but none helped with the actual performance.
I do not understand why it takes so long for just 20 Million records. I know that for computing weighted moving average, it needs to do 20 M X 250 (the window size) averages and divisions, but with 16 cores (first run) I don't see why it would take so long. I can't imagine how long it would take for the rest of the 479 remaining feature columns!
I have also tried increasing the default shuffle paritions, by setting:
spark.sql.shuffle.partitions 1000
but even with 1000 partitions, it didn't bring the time down.
Also tried sorting the data by ID and DATE before calling the window aggregations, without any benefit.
Is there any way to improve this, or window functions generally run slow with my usecase? This is 20M rows only, nowhere near what spark can process with other types of workload..
Your dataset size is approximately 70 GB.
if i understood it correctly for each id it is sorting on date for all the records and then taking the preceding 250 records to do average. As you need to apply this on more than 400 columns, i would recommend trying bucketing while parquet creation to avoid shuffling. it takes considerable amount of time for writing the bucketted parquet file but for all 480 columns derivation that may not take 8 minutes *480 executing time.
please try bucketing or repartition and sortwithin while creating parquet file and let me know if it works.
I have some data flows need to be calculated. I am thinking about use spark stream to do this job. But there is one thing I am not sure and feel worry about.
My requirements is like :
Data comes in as CSV files every 5 minutes. I need report on data of recent 5 minutes, 1 hour and 1 day. So If I setup a spark stream to do this calculation. I need a interval as 5 minutes. Also I need to setup two window 1 hour and 1 day.
Every 5 minutes there will be 1GB data comes in. So the one hour window will calculate 12GB (60/5) data and the one day window will calculate 288GB(24*60/5) data.
I do not have much experience on spark. So this worries me.
Can spark handle such big window ?
How much RAM do I need to calculation those 288 GB data? More than 288 GB RAM? (I know this may depend on my disk I/O, CPU and the calculation pattern. But I just want some estimated answer based on experience)
If calculation on one day / one hour data is too expensive in stream. Do you have any better suggestion?
Is it anyhow possible to get the count, how often DynamoDB throughput (write units/read units) was downscaled within the last 24 hours?
My idea is to downscale as soon as an hugo drop e.g. 50% in the needed provisioned write units occur. I have really peaky traffic. Thus it is interessting to me to downscale after every peak. However I have a analytics jobs running at night which is provisioning a huge amount of read units making it necessary to be able to downscale after it. Thus I need to limit downscales to 3 times within 24 hours.
The number of decreases is returned in a DescribeTable result as part of the ProvisionedThroughputDescription.