Spark Cassandra connector control number of reads per sec - apache-spark

I am running spark application which performs a direct join on Cassandra table
I am trying to control the number of reads per sec, so that the long-running job doesn't impact the overall database
Here are my configuration parameters
--conf spark.cassandra.concurrent.reads=2
--conf spark.cassandra.input.readsPerSec=2
--conf spark.executor.cores=1
--conf spark.executor.instances=1
--conf spark.cassandra.input.fetch.sizeInRows=1500
I know I won't read more than 1500 rows from each partition
However, in spite of all thresholds reads per sec are crossing 200-300
Is there any other flag or configuration that needs to be turned on

it seams that CassandraJoinRDD has a bug in throttling with spark.cassandra.input.readsPerSec, see https://datastax-oss.atlassian.net/browse/SPARKC-627 for details.
In the meantime use spark.cassandra.input.throughputMBPerSec to throttle your join. Note that the throttling is based on RateLimiter class, so the throttling won't kick in immediately (you need to read at least throughputMBPerSec of data to start throttling). This is something that may be improved in the SCC.

Related

Multiple Spark Applications at same time , same Jarfile... Jobs are in waiting status

Spark/Scala noob here.
I am running spark in a clustered environment. I have two very similar apps (each with unique spark config and context). When I try and kick them both off the first seems to grab all resources and the second will wait to grab resources. I am setting resource on the submit but it doesn't seem to matter. Each node has 24 cores and 45 gb memory available for use. Here are the two commands I am using to submit that I want to run in parallel.
./bin/spark-submit --master spark://MASTER:6066 --class MainAggregator --conf spark.driver.memory=10g --conf spark.executor.memory=10g --executor-cores 3 --num-executors 5 sparkapp_2.11-0.1.jar -new
./bin/spark-submit --master spark://MASTER:6066 --class BackAggregator --conf spark.driver.memory=5g --conf spark.executor.memory=5g --executor-cores 3 --num-executors 5 sparkapp_2.11-0.1.jar 01/22/2020 01/23/2020
Also I should note that the second App does kick off but in the master monitoring webpage I see it as "Waiting" and it will have 0 cores until the first is done. The apps do pull from the same tables, but will have much different data chunks that they are pulling so the RDD/Dataframes are unique if that makes a difference.
What am I missing in order to run these at the same time?
second App does kick off but in the master monitoring webpage I see it
as "Waiting" and it will have 0 cores until the first is done.
I encountered the same thing some time back. Here there are 2 things..
May be these are the causes.
1) You dont have proper infrastructure.
2) You might have used capacity scheduler which doesnt have preemptive machanism to accommodate new jobs until it.
If it is #1 then you have to increase more nodes allocate more resouces using your spark-submit.
If it is #2 Then you can adopt hadoop fair schedular where you can maintain 2 pools see spark documentation on this advantage would be you can run parllel jobs Fair will take care by pre-empting some resouces and allocate to another job which is running parllely.
mainpool for the first spark job..
backlogpool to run your second spark job.
To achive this you need to have an xml like this with pool configuration
sample pool configuration :
<pool name="default">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>3</minShare>
</pool>
<pool name="mainpool">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>3</minShare>
</pool>
<pool name="backlogpool">
<schedulingMode>FAIR</schedulingMode>
<weight>3</weight>
<minShare>3</minShare>
</pool>
Along with that you need to do some more minor changes... in the driver code like which pool first job should go and which pool second job should go.
How it works :
For more details see the articles by me..
hadoop-yarn-fair-schedular-advantages-explained-part1
hadoop-yarn-fair-schedular-advantages-explained-part2
Try these ideas to overcome the waiting. Hope this helps..

Spark : Understanding Dynamic Allocation

I have launched a spark job with the following configuration :
--master yarn --deploy-mode cluster --conf spark.scheduler.mode=FAIR --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.maxExecutors=19 --conf spark.dynamicAllocation.minExecutors=0
It works well and finished in success, but after checking spark history ui, this is what i saw :
My questions are (Im concerned by understanding more than solutions) :
Why spark request the last executor if it has no task to do ?
How can i optimise cluster resource requested by my job in the dynamic allocation mode ?
Im using Spark 2.3.0 on Yarn.
You need to respect the 2 requierements for using spark dynamic allocation:
spark.dynamicAllocation.enable
spark.shuffle.service.enabled => The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files.
The resources are adjusted dynamically based on the workload. The app will give resources back if you are no longer using it.
I am not sure that there is an order, it will just request executors in round and exponentially, i.e: an application will add 1 executor in the first round, and then 2, 4 8 and so on...
Configuring external shuffle service
It's difficult to know what Spark did there without knowing the content of the job you submitted. Unfortunately the configuration string you provided does not say much about what Spark will actually perform upon job submission.
You will likely get a better understanding of what happened during a task by looking at the 'SQL' part of the history UI (right side of the top bar) as well as at the stdout logs.
Generally one of the better places to read about how Spark works is the official page: https://spark.apache.org/docs/latest/cluster-overview.html
Happy sparking ;)
Its because of the allocation policy :
Additionally, the number of executors requested in each round
increases exponentially from the previous round.
reference

Spark dropping executors while reading HDFS file

I'm observing a behavior where spark job drops executors while reading data from HDFS. Below is the configuration for spark shell.
spark-shell \
--executor-cores 5 \
--conf spark.shuffle.compress=true \
--executor-memory=4g \
--driver-memory=4g \
--num-executors 100
query: spark.sql("select * from db.table_name").count
This particular query would spin up ~ 40,000 tasks. While execution, number of running tasks start at 500, then the no of running tasks would
slowly drop down to ~0(I have enough resources) and then suddenly spikes to 500(dynamic allocation is turned off). I'm trying to understand the reason for this behavior and trying to look for possible ways to avoid this. This drop and spike happens only when I'm trying to read stage, all the intermediate stages will run in parallel without such huge spikes.
I'll be happy to provide any missing information.

Spark tasks one more than number of partitions

I am trying to do a simple count and group by in spark dataset.
However each time one of the stages get stuck like (200/201 1 running).
I have retried with several partitions ranging from 1000 to 6000. Each time I am stuck in a stage showing (1000/1001 1 running) or (6000/6001 1 running) in status bar.
Kindly help me as to where this extra 1 task is getting spawned from.
The spark-submit options are as below :
--conf spark.dynamicAllocation.enabled=false --conf spark.kryoserializer.buffer.max=2000m --conf spark.shuffle.service.enabled=true --conf spark.yarn.executor.memoryOverhead=4000 --conf spark.default.parallelism=3000 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.sql.shuffle.partitions=6000 --conf spark.driver.memory=30g --conf spark.yarn.maxAppAttempts=1 --conf spark.driver.cores=6 --num-executors 80 --executor-cores 5 --executor-memory 40g
The number of spark shuffle partitions are huge. Spark writes the files to disk for each shuffle partition. That may take a lot of time if you have such a large number of partitions as well as shuffle partitions. You could try reducing both the default parallelism and shuffle partitions.
It's hard to know without seeing your specific spark code and the input format, but the first thing I would look into is data skew in your input data.
If one task is consistently taking longer to complete it's probably because it is significantly larger than the others. This will happen during a shuffle if one key in your data that you are grouping by shows up way more frequently than others since they will all end up in the same shuffled partition.
That being said, if you are literally just doing df.groupBy("key").count then Spark won't need to shuffle the values, just the intermediate sums for each key. That's why it would be helpful to see your specific code.
Another consideration is that your input format and data will define the number of initial partitions, not your spark parallelism settings. For example if you have 10 gzip'ed text files, you will only ever be able to have 10 input partitions. It sounds like the stage you are seeing get stuck is changing task counts with your setting changes though, so I'm assuming it's not the first stage.

Cassandra inconsistencies with batch updates

I'm using Apache Spark with Spark Cassandra Connector to write millions of rows to a Cassandra cluster. Replication factor is set to 3 and I set the write consistency to ALL in spark-submit (YARN client mode) using the following options:
spark-submit ...
--conf spark.cassandra.output.consistency.level=ALL \
--conf spark.cassandra.output.concurrent.writes=1 \
--conf spark.cassandra.output.batch.size.bytes=20000 \
...
I then wrote another Spark job to count the data I have written. I set up consistency of the new Job as follow:
spark-submit ...
--conf spark.cassandra.input.consistency.level=ONE \
--conf spark.cassandra.input.split.size=50000 \
...
From the documentation, if the write consistency plus the read consistency is greater than the replication factor, I should have consistent reads.
But I'm getting the following results:
The read job gives me different results (count) every time I run it
If I increase the consistency level of the read Job I get the expected results
What am I missing ? Is there any secret configuration that is set up by default (e.g. in case of issues during write then decrease the consistency level, or something like that...) or am I using a buggy version of Cassandra (it's 2.1.2), or are there issues with batch updates that spark-cassandra-connector uses for saving data to Cassandra (I'm simply using the "saveToCassandra" method) ?
What's going wrong ?
I confirm that this is a bug in connector. Consistency level is being set on individual prepared statements and is simply ignored in case when we use batch statements. Follow the updates on the connector - the fix is going to be included in the next bug-fix release.

Resources