Cassandra inconsistencies with batch updates - cassandra

I'm using Apache Spark with Spark Cassandra Connector to write millions of rows to a Cassandra cluster. Replication factor is set to 3 and I set the write consistency to ALL in spark-submit (YARN client mode) using the following options:
spark-submit ...
--conf spark.cassandra.output.consistency.level=ALL \
--conf spark.cassandra.output.concurrent.writes=1 \
--conf spark.cassandra.output.batch.size.bytes=20000 \
...
I then wrote another Spark job to count the data I have written. I set up consistency of the new Job as follow:
spark-submit ...
--conf spark.cassandra.input.consistency.level=ONE \
--conf spark.cassandra.input.split.size=50000 \
...
From the documentation, if the write consistency plus the read consistency is greater than the replication factor, I should have consistent reads.
But I'm getting the following results:
The read job gives me different results (count) every time I run it
If I increase the consistency level of the read Job I get the expected results
What am I missing ? Is there any secret configuration that is set up by default (e.g. in case of issues during write then decrease the consistency level, or something like that...) or am I using a buggy version of Cassandra (it's 2.1.2), or are there issues with batch updates that spark-cassandra-connector uses for saving data to Cassandra (I'm simply using the "saveToCassandra" method) ?
What's going wrong ?

I confirm that this is a bug in connector. Consistency level is being set on individual prepared statements and is simply ignored in case when we use batch statements. Follow the updates on the connector - the fix is going to be included in the next bug-fix release.

Related

Spark Cassandra connector control number of reads per sec

I am running spark application which performs a direct join on Cassandra table
I am trying to control the number of reads per sec, so that the long-running job doesn't impact the overall database
Here are my configuration parameters
--conf spark.cassandra.concurrent.reads=2
--conf spark.cassandra.input.readsPerSec=2
--conf spark.executor.cores=1
--conf spark.executor.instances=1
--conf spark.cassandra.input.fetch.sizeInRows=1500
I know I won't read more than 1500 rows from each partition
However, in spite of all thresholds reads per sec are crossing 200-300
Is there any other flag or configuration that needs to be turned on
it seams that CassandraJoinRDD has a bug in throttling with spark.cassandra.input.readsPerSec, see https://datastax-oss.atlassian.net/browse/SPARKC-627 for details.
In the meantime use spark.cassandra.input.throughputMBPerSec to throttle your join. Note that the throttling is based on RateLimiter class, so the throttling won't kick in immediately (you need to read at least throughputMBPerSec of data to start throttling). This is something that may be improved in the SCC.

Spark : Understanding Dynamic Allocation

I have launched a spark job with the following configuration :
--master yarn --deploy-mode cluster --conf spark.scheduler.mode=FAIR --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.maxExecutors=19 --conf spark.dynamicAllocation.minExecutors=0
It works well and finished in success, but after checking spark history ui, this is what i saw :
My questions are (Im concerned by understanding more than solutions) :
Why spark request the last executor if it has no task to do ?
How can i optimise cluster resource requested by my job in the dynamic allocation mode ?
Im using Spark 2.3.0 on Yarn.
You need to respect the 2 requierements for using spark dynamic allocation:
spark.dynamicAllocation.enable
spark.shuffle.service.enabled => The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files.
The resources are adjusted dynamically based on the workload. The app will give resources back if you are no longer using it.
I am not sure that there is an order, it will just request executors in round and exponentially, i.e: an application will add 1 executor in the first round, and then 2, 4 8 and so on...
Configuring external shuffle service
It's difficult to know what Spark did there without knowing the content of the job you submitted. Unfortunately the configuration string you provided does not say much about what Spark will actually perform upon job submission.
You will likely get a better understanding of what happened during a task by looking at the 'SQL' part of the history UI (right side of the top bar) as well as at the stdout logs.
Generally one of the better places to read about how Spark works is the official page: https://spark.apache.org/docs/latest/cluster-overview.html
Happy sparking ;)
Its because of the allocation policy :
Additionally, the number of executors requested in each round
increases exponentially from the previous round.
reference

How to stop Spark Structured Streaming from filling HDFS

I have a Spark Structured Streaming task running on AWS EMR that is essentially a join of two input streams over a one minute time window. The input streams have a 1 minute watermark. I don't do any aggregation. I write results to S3 "by hand" with a forEachBatch and a foreachPartition per Batch that converts the data to string and writes to S3.
I would like to run this for a long time, i.e. "forever", but unfortunately Spark slowly fills up HDFS storage on my cluster and eventually dies because of this.
There seem to be two types of data that accumulate. Logs in /var and .delta, .snapshot files in /mnt/tmp/.../. They don't get deleted when I kill the task with CTRL+C (or in case of using yarn with a yarn application kill) either, I have to manually delete them.
I run my task with spark-submit. I tried adding
--conf spark.streaming.ui.retainedBatches=100 \
--conf spark.streaming.stopGracefullyOnShutdown=true \
--conf spark.cleaner.referenceTracking.cleanCheckpoints=true \
--conf spark.cleaner.periodicGC.interval=15min \
--conf spark.rdd.compress=true
without effect. When I add --master yarn the paths where the temporary files are stored change a bit, but the problem of them accumulating over time persists. Adding a --deploy-mode cluster seems to make the problem worse as more data seems to be written.
I used to have a Trigger.ProcessingTime("15 seconds) in my code, but removed it as I read that Spark might fail to clean up after itself if the trigger time is too short compared to the compute time. This seems to have helped a bit, HDFS fills up slower, but temporary files are still piling up.
If I don't join the two streams, but just select on both and union the results to write them to S3 the accumulation of cruft int /mnt/tmp doesn't happen. Could it be that my cluster is too small for the input data?
I would like to understand why Spark is writing these temp files, and how to limit the space they consume. I would also like to know how to limit the amount of space consumed by logs.
Spark fills HDFS with logs because of https://issues.apache.org/jira/browse/SPARK-22783
One needs to set spark.eventLog.enabled=false so that no logs are created.
in addition to #adrianN's answer, on the EMR side, they retain application logs on HDFS - see https://aws.amazon.com/premiumsupport/knowledge-center/core-node-emr-cluster-disk-space/

Spark dropping executors while reading HDFS file

I'm observing a behavior where spark job drops executors while reading data from HDFS. Below is the configuration for spark shell.
spark-shell \
--executor-cores 5 \
--conf spark.shuffle.compress=true \
--executor-memory=4g \
--driver-memory=4g \
--num-executors 100
query: spark.sql("select * from db.table_name").count
This particular query would spin up ~ 40,000 tasks. While execution, number of running tasks start at 500, then the no of running tasks would
slowly drop down to ~0(I have enough resources) and then suddenly spikes to 500(dynamic allocation is turned off). I'm trying to understand the reason for this behavior and trying to look for possible ways to avoid this. This drop and spike happens only when I'm trying to read stage, all the intermediate stages will run in parallel without such huge spikes.
I'll be happy to provide any missing information.

How COPY works in cassandra when table is replicated accross multplie nodes in a cluster?

Suppose if want to copy a table from a cluster of 7 nodes with RF = '3' to another cluster of 6 nodes with RF '3' how can i do that ?? can i copy data from anyone of the node to CSV file and then import that data from CSV file to any node in the new cluster ?? or should copy data from each and every node in cluster to new cluster ??
should i decrease replication to 1 then copy data and change replication to 3 but i think this will not work in production ?? how can i tackle this ??
Its not something you have to run on each node. You can use cqlsh's COPY command on a system outside the cluster. Restoring cluster from sstables/commitlogs is where you need to worry about that (which sstableloader solves as well).
It will read all the data when using COPY TO and when using COPY FROM it will send each row through the write path which will distribute according to your RF. Its done far more efficiently then using a basic read/write script but thats ultimately still what its doing.
Check out my post on this if you have access to Spark (this is the best way to do a migration if you have a LOT of data). They copy command will work if you don't have a lot of data.
www.sestevez.com/cluster-migration-keeping-simple-things-simple/
wget https://github.com/phact/dse-cluster-migration/releases/download/v0.01/dse-cluster-migration_2.10-0.1.jar
dse spark-submit --class phact.MigrateTable --conf spark.dse.cluster.migration.fromClusterHost='<from host>' --conf spark.dse.cluster.migration.toClusterHost='<to host>' --conf spark.dse.cluster.migration.keyspace='<keyspace>' --conf spark.dse.cluster.migration.table='<table>' ./dse-cluster-migration_2.10-0.1.jar

Resources