Memsql Auto Increment - singlestore

I'm new to Memsql. I'm facing a problem with Memsql auto increment.
I created a new table with id(auto increment=1). While insert manually using insert command it's showing id(auto increment) 1, but while using spark it's starting from 48,413,695,994,232,833.
In spark I'm creating dummy data val test = Seq(("Test1", "600482", "46987"),("Test2", "600204", "4870A"),("Test3", "600204", "469870A")).toDF("confid", "confidprefix","salesid").
I used memsql-connector_2.11-2.0.4.jar,memsql-connector_2.11-2.0.2.jar.

In MemSQL, auto_increment only guarantees that automatically generated values are unique - not that they are sequential like 1, 2, 3... See https://docs.memsql.com/sql-reference/v6.0/create-table/#auto-increment-behavior

In my 5 nodes memsql cluster i have 2 Aggregator nodes, Master Aggregator and Child Aggregator.
Now i just removed that Child Aggregator from my memsql cluster and now working fine. Its generating auto increment id correctly.
Note that the automatically generated values can differ depending on which aggregator you run the inserts on. Of course, if you ran some inserts on one aggregator and some inserts on another aggregator, you would get different automatically generated values. Also note that automatically generated values and explicitly set values can collide.

Related

Partition DELETE/INSERT concurrency issue in Cassandra

I have a table in Cassandra which stores versions of csv-files. It uses a primary key with a unique id for the version (the partition key) and a row number (the clustering key). When I insert a new version I first execute a delete statement on the partition key I am about to insert, to clean up any incomplete data. Then the data is inserted.
Now here is the issue. Even though the delete and subsequent insert are executed synchronously after one another in the application it seems that some level of concurrency still exist in Cassandra, because when I read afterwards, rows from my insert will be missing occasionally - something like 1 in 3 times. Here are some facts:
Cassandra 3.0
Consistency ALL (R+W)
Delete using the Java Driver
Insert using the Spark-Cassandra connector
Number of nodes: 2
Replication factor: 2
The delete statement I execute looks like this:
"DELETE FROM myTable WHERE version = 'id'"
If I omit it, the problem goes away. If I insert a delay between the delete and the insert the problem is reduced (less rows missing). Initially I used a less restrictive consistency level, and I was sure this was the issue, but it didn't affect the problem. My hypothesis is that for some reason the delete statement is being sent to the replica asynchronously despite the consistency level of ALL, but I can't see why this would be the case or how to avoid it.
All mutations are going to by default get a write time of the coordinator for that write. From the docs
TIMESTAMP: sets the timestamp for the operation. If not specified,
the coordinator will use the current time (in microseconds) at the
start of statement execution as the timestamp. This is usually a
suitable default.
http://cassandra.apache.org/doc/cql3/CQL.html
Since the coordinator for different mutations can be different, a clock skew between coordinators can end up with a mutations to one machine to be skewed relative to another.
Since write time controls C* history this means you can have a driver which synchronously inserts and deletes but depending on the coordinator the delete can happen "before" the insert.
Example
Imagine two nodes A and B, B is operating with a 5 second clock skew behind A.
At time 0: You insert data to the cluster and A is chosen as the coordinator. The mutation arrives at A and A assigns a timestamp (0)
There is now a record in the cluster
INSERT VALUE AT TIME 0
Both nodes contain this message and the request returns confirming the write was successful.
At time 2: You issue a delete for the data previously inserted and B is chosen as the coordinator. B assigns a timestamp of (-3) because it is clock skewed 5 seconds behind the time in A. This means that we end up with a statement like
DELETE VALUE AT TIME -3
We acknowledge that all nodes have received this record.
Now the global consistent timeline is
DELETE VALUE AT TIME -3
INSERT VALUE AT TIME 0
Since the insertion occurs after the delete the value still exists.
I have got similar problem, and I have fixed it by enabling Light-Weight-Transaction for both INSERT and DELETE requests (for all queries actually, including UPDATE). It will make sure all queries to this partition are serialized through one "thread", so DELETE wan't overwrite INSERT. For example (assuming instance_id is a primary key):
INSERT INTO myTable (instance_id, instance_version, data) VALUES ('myinstance', 0, 'some-data') IF NOT EXISTS;
UPDATE myTable SET instance_version=1, data='some-updated-data' WHERE instance_id='myinstance' IF instance_version=0;
UPDATE myTable SET instance_version=2, data='again-some-updated-data' WHERE instance_id='myinstance' IF instance_version=1;
DELETE FROM myTable WHERE instance_id='myinstance' IF instance_version=2
//or:
DELETE FROM myTable WHERE instance_id='myinstance' IF EXISTS
IF clauses enable light-wight-transactions for each row, so all of them are serialized. Warning: LWT is more expensive than normal calls, but sometimes they are needed, like in the case of this concurrency problem.

Cassandra UPDATE not working after deletion

I'm using a wide row schema in Cassandra. My table definition is as follows:
CREATE TABLE usertopics (
key text,
topic text,
score counter,
PRIMARY KEY (key, topic)
)
I'm inserting entries using:
UPDATE usertopics SET score = score + ? WHERE key=? AND topic=?
such that if key does not exist it will insert and if it exists it will update.
I'm deleting entries from using:
Delete form usertopics where key in ?
But after deletion when I'm trying to update again, it's not updating. It's not giving any error, but it's not reflecting in db as well.
It's inserting perfectly again when I'm truncating the table. I'm using Datastax java driver for accessing Cassandra. Any suggestions?
From cassandra documentation -
Counter removal is intrinsically limited. For instance, if you issue
very quickly the sequence "increment, remove, increment" it is
possible for the removal to be lost (if for some reason the remove
happens to be the last received messages). Hence, removal of counters
is provided for definitive removal only, that is when the deleted
counter is not increment afterwards. This holds for row deletion too:
if you delete a row of counters, incrementing any counter in that row
(that existed before the deletion) will result in an undetermined
behavior. Note that if you need to reset a counter, one option (that
is unfortunately not concurrent safe) could be to read its value and
add -value.
Once deleted, a counter with same key cannot/should not be used. Please use the below links for further info -
https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_counter_t.html
https://wiki.apache.org/cassandra/Counters

Change order of composed primary key

I have a Cassandra and I want to use the cql "IN" query. Therefore I have to change the order of the elements in my composed primary key (only the last piece is available for "IN" queries). The table is quite big but does not span multiple nodes now.
So what I have tried now (which is not working) is the following:
create a new column family with identical columns but different order of primary key elements
stop write processes and nodetool flush
copy all /data/keyspace/columnfamily/ files
rename the files to match the new column family name
use the sstable loader to load the files into the new column family
But afterwards the primary key is just messed up:
Failed to decode value '53ccb45d4ab0d3560e8c36fd' (for column 'cent') as int: unpack requires a string argument of length 4
I can also not use COPY ... TO ... because this is just timing out ...
Any ideas?
There are a couple of good bulkloaders available on GIT that works better and wont timeout like the CQLSH COPY TO/FROM tool.
You can find it here. or here
Otherwise I'd recommend using something like SPARK to move the data for you.
You could also use SCALA once you have your second table already created:
val mydata = sc.cassandraTable("mykeyspace","mytable")
.select("key","column1","column2","column3")
mydata.saveToCassandra("whateverkeyspace","whatevertable", SomeColumns("key","column1","column2","colum3"))

batch update cassandra with lightweight transaction

I am using cassandra 2.2.3 and want to make a batch update with two statements. Both using a lightweight transaction.
BEGIN BATCH
UPDATE account SET values['balance'] = 11 WHERE id = 1 IF values['balance'] = 10;
UPDATE account SET values['balance'] = 11 WHERE id = 2 IF values['balance'] = 10;
APPLY BATCH;
The batch returns following error:
InvalidRequest: code=2200 [Invalid query] message="Batch with conditions cannot span multiple partitions".
I understand that it is not possible to make a batch on various PKs in the where clause because of the partitions, but why it is not possible to do a batch on the same PK? The problems are the IF statements, removing them, the batch is working.
So is there a solution to successfully execute such a batch update? Or any workaround?
EDIT:
This is my schema:
CREATE TABLE booking.account (
id int PRIMARY KEY,
values map<varchar, decimal>,
timestampCreate timestamp,
timestampUpdate timestamp
);
I understand that it is not possible to make a batch on various PKs in
the where clause because of the partitions, but why it is not possible
to do a batch on the same PK?
You could make a batch on various PKs in the where clause, However this is not recommended (Please refer to Cassandra: Batch loading without the Batch keyword).
The problem here is conditional update (the if statement). Quote from datastax cql reference.
In Cassandra 2.0.6 and later, you can batch conditional updates
introduced as lightweight transactions in Cassandra 2.0. Only updates
made to the same partition can be included in the batch because the
underlying Paxos implementation works at the granularity of the
partition. You can group updates that have conditions with those that
do not, but when a single statement in a batch uses a condition, the
entire batch is committed using a single Paxos proposal, as if all of
the conditions contained in the batch apply.
So do you really need batch statement? Read this Using and misusing batches

Cassandra insert fails

We're experiencing problems with writing data in cassandra table.
The flow is following.. we delete all records from XXX with some primary key.
Then inserting new ones in loop.
execute("DELETE FROM XXX WHERE key = {SOME_UUID}");
for(int i = 0; i < 5; ++i) {
execute("INSERT INTO XXX (key, field1, field2) VALUES ({SOME UUID},'field1','field2')";
}
The result: Sometimes not all rows are inserted into the table. After querying the table we see that not all rows were inserted.
The environment we have:
We use DataStax Enterprise Edition (4.5.2). Cassandra 2.0.10.
The datacenter has 4 nodes and the keyspace we work on has replication_factor set to 3.
The queries CONSISTENCY_LEVEL is set to LOCAL_QUORUM.
The java driver is data stax enterprise 2.1.1
Thanks in advance.
Any help would be appreciated.
I assume in your example that SOME_UUID is the same for the delete and the insert.
It's probably a race condition between the delete (tombstone) and the new inserts being propagated to all the nodes (per your replication factor). If the delete and insert are marked with the same timestamp, the delete will win. You may have a case where on some nodes the delete wins, and on others the insert wins.
You could try lowering RF to 1, as #BryceAtNetwork23 suggested.
Another test would be to insert a delay (like 500ms) in your sample code between the delete and the insert for loop. That would give time for the delete to propagate before the inserts come through.
Depending on your data model, the best solution here might be to avoid the need for the deletes.

Resources