Is it possible to specify the WRITETIME in a Cassandra INSERT command? - cassandra

I am having a problem where a few INSERT commands are viewed as being send simultaneously on the Cassandra side when my code clearly does not send them simultaneously. (When you get a little congestion on the network, then the problem happens, otherwise, everything works just fine.)
What I am thinking would solve this problem is a way for me to be able to specify the WRITETIME myself. From what I recall, that was possible in thrift, but maybe not (i.e. we could read it for sure.)
So something like this (to simulate the TTL):
INSERT INTO table_name (a, b, c) VALUES (1, 2, 3) USING WRITETIME = 123;
The problem I'm facing is overwriting the same data and once in a while the update is ignored because it ends up with the same or even an older timestamp (probably because it is sent to a different node and the time of each node is slightly different and since the C++ process uses threads, it can be send before/after without your control...)

The magic syntax you're looking for is:
INSERT INTO tbl (col1, col2) VALUES (1,2) USING TIMESTAMP 123456789000
Be very cautious using this approach - make sure you use the right units (microseconds, typically).
You can override the meaning of time stamps in some cases - it's a sneaky trick we've used in the past to do clever things like first-write-wins and even stored leaderboard values in the TIMESTAMP field so highest score would be persisted, but you should REALLY understand the concept before trying these (deletes become nontrivial)

Related

PostgreSQL: Is it possible to limit inserts per user based on time difference between timestamp column and current time?

I have an issue when two almost concurrent requests (+- 10ms difference) by the same user (unintentionally duplicated by client side) successfully execute whole use case logic twice. I can't really solve this situation in code of my API, so I've been thinking about how to limit one user_id to be able to insert row into table order max. once every second for example.
I want to achieve this: If in table order exists row with user_id X and that row was created (inserted) less than 1 second ago, insert with user_id X would fail.
This could be effective way of avoiding unintentionally duplicated requests by client side. Because I can't imagine situation when user could send two complex requests less than 1 second between intentionally. I'm also interested in any other ideas, for example what's the proper way to deal with similar situations in API's.
There is one problem with your idea. If the server becomes really slow for just a second, the orders will arrive more than one second apart in the database and will be inserted.
I'd recommend generating a unique ID, like a UUID, in the front-end, and sending that with the request. You could, for example, generate a new one every page load. Then, if the server sees that the received UUID already exists in the database, the order is skipped.
This avoids any potential timing issues, but also retains the possibility of someone re-ordering the exact same products.
You can do it with an EXCLUDE constraint. You need to create your own immutable helper function, and use an extension.
create extension btree_gist;
create function addsec(timestamptz) returns tstzrange immutable language sql as $$
select tstzrange($1,$1+interval '1 second')
$$;
create table orders (
userid int,
t timestamptz,
exclude using gist (userid with =, addsec(t) with &&)
);
But you should probably change the front end anyway to include a validation token, as currently it may be subject to CSRF attacks.
Note that EXCLUDE constraints may be much less efficient than UNIQUE constraints. Also, I'm not 100% sure that addsec really is immutable. There might be weird things with leap seconds or something that messes it up.

DSE Cassandra 3.x delete operation

I have a table with a PRIMARY KEY of ( (A,B), C)
Partition key (A,B)
Clustering key C
My question is related to deleting from this table.
Is it efficient to use the IN clause when deleting or to issue multiple
delete statements using the equality operation.
delete from table where A=xx and B IN ('a','b','c');
-OR-
delete from table where A=xx and B='a';
delete from table where A=xx and B='b';
delete from table where A=xx and B='c';
Is there any harm in using the IN operator as in the 1st delete statement.
There may be up to around 20 deletes in total (or 20 items in the IN clause).
Thanks in advance for all your help!
With a few small exceptions its almost always better to use the 2nd option multiple deletes issued asynchronously instead. The coordinator of the IN clause will be put on a lot of load while the later will evenly distribute the load. Also with a TokenAware load balancer the requests will go directly to the correct replicas and can complete pretty quickly. If you are doing hundreds or more of the deletes you might wanna use a Semaphore or something though to limit number of in flight deletes, just to prevent overloading cluster.
It depends on the needs of your application. If the delete operations are expected to be fast, then you'll probably want to run each one explicitly (second option).
On the other hand, if the delete runs as a part of a batch or cleanup job, and nobody really cares how long it takes, then you could probably get away with using IN. The trick there would be in keeping it from timing-out (and as Chris indicated, putting undue load on the node). It might make sense to break-down your groups of values for column B, to keep those small. While 20 list items with IN isn't the most I've heard of someone trying, it's definitely more than I would ever use personally (I'd try to keep it smaller than 10).
Essentially, using the IN operator with a DELETE is going to be susceptible to performance issues just like it would be on a SELECT, as described in this answer (included here for reference):
Is the IN relation in Cassandra bad for queries?

When read-your-own-writes can fail?

I use RL/WL=QUORUM and send two updates, is it possible that next SELECT reads my first update, in some circumstances?
CREATE TABLE aggr(
id int,
mysum int,
PRIMARY KEY(id)
)
INSERT INTO aggr(id, mysum) VALUES(1, 2)
INSERT INTO aggr(id, mysum) VALUES(1, 3)
SELECT mysum FROM aggr WHERE id=1 -- expect mysum=3 here, but is it a must?
As I can judge from here it is possible even to lost part of the second update if two updates come within same timestamp.
If I work around timestamp problem, can I be sure that I always read what I wrote last time?
No, assuming your using client side monotonic timestamps (current default, wasn't in past). But it is possible with other settings. I am assuming here that its a single client issuing those two writes. If the 2 inserts are coming from two different servers it all depends on their timestamps.
This is the default for java driver 3.x but if using a version of cassandra pre CQL3 (2.0) you need to provide them with USING TIMESTAMP in your query since the protocol didn't support it. Otherwise the two writes can go to different coordinators, and if the coordinators have clock drift between them the 1st insert may be considered "newer" than the 2nd. With client side timestamps though (should be the default on your driver if using new versions) thats not the case.
If you do your updates synchronously with CL=QUORUM the second update will always overwrite the first one. A lower consistency level on any of the requests would not guarantee this.

DateTieredCompaction without Timestamp Col

I think this question: Does DateTieredCompactionStrategy work with composite keys? is essentially the same question, but I would like to confirm..
My table (simplified) looks like:
CREATE TABLE foo (
name text,
textId text,
message text
PRIMARY KEY
((name),textId))
WITH CLUSTERING ORDER BY (textId ASC)
We see here we have no TIMESTAMP column, which to confirm is in fact not needed for DataTieredCompaction, (since this strategy leverages the actual write time, correct?)
The textId does in fact have 'TS' encoded in, which is close the actual writeTimeStamp, but will vary a little bit.
Are use case is, all data is inserted (never updated), with a TTL.
In most cases we expect the records to be deleted prior to TTL. Does this sound like an appropriate use for DateTieredCompaction? If not why not?
I've started running some tests, and it does appear to working so far, but will require longer runs (In particular I'm trying to avoid an issue we saw before..where we saw performance plummet after x amount time, which I believe is either due heavy compaction and/or repair processes (we were running on 3 node cluster with LevelTieredCompaction), but have moved up to a 5 node cluster, in the hopes repair will only slow down 2 of the nodes at a time, leaving a full quorum operational at ~100% performance. (In fact for our use case, we may have a dedicated cluster with repair turned off, and replication factor 1 (we self replicate to other DCs)
Thoughts?
ps. We are running cassandra 2.1

MemSQL - why can't I do a cross-database insert into .. select

I'm trying to do a simple insert with a field list from a table in one database to a table in another.
insert into db_a.target_table (field1,field2,field3) select field1,field2,field3 from db_b.source_table;
The error message seems straight-forward..
MemSQL does not support this type of query: Cross-database INSERT ... SELECT
Oddly enough, this example does work:
insert into db_a.target_table select * from db_b.source_table;
But this seems like such a common scenario. Has anyone run into a similar issue, and were you able to work around it?
Unfortunately, this isn't allowed because it is difficult to keep such queries transactional; multi-statement transactions are used internally to guarantee transactionality of the single insert-select (if one partition fails (dup key or something), we want to rollback everything!). Since we don't have cross-db multi-statement transactions (yet!), we don't have cross-db insert-select (yet!).
Stay tuned for nicer solutions.
However, if you REAAALY want to do this, here is what you do. However,
PROCEED AT YOUR OWN RISK. THIS IS NOT A SUPPORTED PROCEEDURE.
But it should work.
1) On db_b, create a table with the same columns as source_table, but make the shard key SHARD().
2) On db_a, run SHOW PARTITIONS.
3) For each of those partitions, create a connection to db_a_<ordinal> on the host and port listed in SHOW PARTITIONS. Run SHOW DATABASES on that connection and you'll see some databases called db_b_<another>. Pick one, doesn't matter which. Run INSERT INTO db_b<another>.source_table SELECT * from db_a_<ordinal>.source_table.
3.5) At this point, you haven't yet written to a table you care about, but now we will. Look at db_b.source_table. Is everything correct? Is all the data there? Run SHOW CREATE TABLE and double check the shard key is SHARD KEY () (it should be in comments). Everything look good? Ok, we can proceed.
4) After you're done doing this for EVERY partition, you can do INSERT INTO db_b.target_table (cols) SELECT cols from db_b.source_table, or whatever you want.
Good luck!

Resources