Test Cassandras R + W > N - cassandra

I want to explicitly show the R + W > N rule at a presentation.
So my initial idea how to do this is as following:
// setup
1: Create a Docker network with 3 Cassandra Nodes.
2: Create a simple Keyspace with Replication-Factor of 3.
3: Explicitly shutdown two of the docker nodes.
4: Create a write query inserting some data with W=1
As two of the nodes are down but one is still online this should succeed
5: Bring both of the nodes back online
6: Reading the Data I just pushed to one node with R=1
If I understand R + W > N correctly I should now have the chance of 2/3 to get inconsistent data. Which is exactly what I want to show.
My Question is:
Is there an option I need to disable in order to stop nodes from syncing when they get back online?
So I would require to disable these options ?

You need to disable hints on all nodes (set hinted_handoff_enabled in cassandra.yaml to false) - in this case, replication will happen only when you explicitly do the nodetool repair.
You also need to make sure that read_repair_chance and dclocal_read_repair_chance are set to 0 for a table where you'll do the test. The easiest way is just to specify these options when creating the table:
create table (
....)
WITH read_repair_chance = 0.0 and dclocal_read_repair_chance = 0.0;

Related

Nodetool load and own stats

We are running 2 nodes in a cluster - replication factor 1.
After writing a burst of data, we see the following via node tool status.
Node 1 - load 22G (owns 48.2)
Node 2 - load 17G (owns 51.8)
As the payload size per record is exactly equal - what could lead to a node showing higher load despite lower ownership?
Nodetool status uses the Owns column to indicate the effective percentage of the token range owned by the nodes. While GB is Size of your records
Dont see anything wrong here. Your data is almost evenly distributed around your two nodes which is exactly what you want for perfekt performance.

How long will it take to delete a row?

I have a 2 node cassandra cluster with RF=2.
When a delete from x where y cql statement is issued - is it known how long it will take all nodes to delete the row?
What I see in one of the integration tests:
A row is deleted, the result of the deletion is tested with a select * from y where id = xxx statement. What I see is that sometimes the result is not null as expected and the deleted row is still found.
Is the correct approch to read with CL=2 to get the result I am expecting?
make sure that the servers time are in synch if you are using server side timestamp.
Better use client side timestamp.
Is the correct approch to read with CL=2 to get the result I am
expecting?
i assume you are using default consistecy while delete ie 1 and as 1+2 > 2 (ie W+R > N) in your case hence it is ok.
Local to the replica it will be sub ms. The time is dominated in the time from app->coordinator->replica->coordinator->app network hops. Use quorum or local_quorum for consistency across sequential requests like that on both write and read.

Spark window function on dataframe with large number of columns

I have an ML dataframe which I read from csv files. It contains three types of columns:
ID Timestamp Feature1 Feature2...Feature_n
where n is ~ 500 (500 features in ML parlance). The total number of rows in the dataset is ~ 160 millions.
As this is the result of a previous full join, there are many features which do not have values set.
My aim is to run a "fill" function(fillna style form python pandas), where each empty feature value gets set with the previously available value for that column, per Id and Date.
I am trying to achieve this with the following spark 2.2.1 code:
val rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(-50000, -1)
val columns = Array(...) //first 30 columns initially, just to see it working
val rawDataSetFilled = columns.foldLeft(rawDataset) { (originalDF, columnToFill) =>
originalDF.withColumn(columnToFill, coalesce(col(columnToFill), last(col(columnToFill), ignoreNulls = true).over(window)))
}
I am running this job on a 4 m4.large instances on Amazon EMR, with spark 2.2.1. and dynamic allocation enabled.
The job runs for over 2h without completing.
Am I doing something wrong, at the code level? Given the size of the data, and the instances, I would assume it should finish in a reasonable amount of time? And I haven't even tried with the full 500 columns, just with about 30!
Looking in the container logs, all I see are many logs like this:
INFO codegen.CodeGenerator: Code generated in 166.677493 ms
INFO execution.ExternalAppendOnlyUnsafeRowArray: Reached spill
threshold of
4096 rows, switching to
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I have tried setting parameter spark.sql.windowExec.buffer.spill.threshold to something larger, without any impact. Is theresome other setting I should know about? Those 2 lines are the only ones I see in any container log.
In Ganglia, I see most of the CPU cores peaking around full usage, but the memory usage is lower than the maximum available. All executors are allocated and are doing work.
I have managed to rewrite the fold left logic without using withColumn calls. Apparently they can be very slow for large number of columns, and I was also getting stackoverflow errors because of that.
I would be curious to know why this massive difference - and what exactly happens behind the scenes with the query plan execution, which makes repeated withColumns calls so slow.
Links which proved very helpful: Spark Jira issue and this stackoverflow question
var rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(Window.unboundedPreceding, Window.currentRow)
rawDataset = rawDataset.select(rawDataset.columns.map(column => coalesce(col(column), last(col(column), ignoreNulls = true).over(window)).alias(column)): _*)
rawDataset.write.option("header", "true").csv(outputLocation)

Cassandra reads failing intermittently

I am facing a puzzling issue where my cassandra reads didn't return a value for key 'K'. I have a key that was inserted into a column family almost 2 weeks back and when I queried for that key using CQLSH consecutively 8 times, 2 times it didn't return a value. 6 times it showed the results correctly. The replication factor for my keyspace is 3.
This key was written with a QUORUM consistency (Java Hector client). I am reading it using CQLSH, which has a default consistency of one. I have not been able to come up with any explanation for this till now. Any thoughts ?
The reason is that you are not respecting the consistency level disequation:
(WRITE CL + READ CL) > REPLICATION FACTOR
With RF = 3 -> QUORUM = 2
So your disequation say
((WRITE CL) 2 + (READ CL) 1) > 3 2 + 1 > 3 = FALSE
In this post you can find how to achieve consistency. However you might be interested in tuning your read repair chance
Try to set CL quorum in CQLSH using the CONSISTENCY command
cqlsh> CONSISTENCY QUORUM;
And perform the read again
HTH,
Carlo

Partitioning for cassandra nodes, using Murmur3Partitioner

In my project, I use cassandra 2.0, and have 3 database servers.
2 of 3 servers has 2 TB of hard drive, the last has just 200 GB. So, I want the 2 servers response for higher load than the last one.
Cassandra: I use Murmur3Partitioner to partition the data.
My question is: how can I calculate the initial_token for each cassandra instance?
Thanks for your help :)
If you are using a somewhat recent version of Cassandra (2.x) then you can configure the number of tokens a node should hold relative to other nodes in the cluster. There is no need to specify token range boundaries via the initial_token any more. Instead you give a node a "weight" through the num_tokens parameter. As the capacity of your smaller node is roughly 1/10th of the big ones, adjust the weight of that node accordingly. The default weight is 256. So you could start with a weight of 25 for the smaller node and try and see whether it works OK that way.
Murmur3Partitioner : Uniformly distribute the data across the clusters based on the MurmurHash hash value.
Murmur3Partitioner uses a maximum possible range of hash values from -263 to +263-1. Here is the formula to calculate tokens:
python -c 'print [str(((264 / number_of_tokens) * i) - 263) for i in range(number_of_tokens)]'
For example, to generate tokens for 10 nodes:
python -c 'print [str(((264 / 10) * i) - 263) for i in range(10)]'

Resources