Cassandra's lightweight transactions & Paxos consensus algorithm - cassandra

I have a very particular question regarding Paxos algorithm, which is implemented in Cassandra's lightweight transactions:
What happens if two nodes issue the same proposal at the same time? Do them both get '[applied]: true' ?
For example, consider this table:
ids:
+-------------------+---------------+
| id_name (varchar) | next_id (int) |
+-------------------+---------------+
| person_id | 1 |
+-------------------+---------------+
And this query:
UPDATE ids
SET next_id = 2
WHERE id_name = 'person_id'
IF next_id = 1
If I execute this query, I get as a response:
[{[applied]: True}]
If I execute it again, which then it won't be accepted since next_id != 1, I get:
[{[applied]: False, next_id: 2}]
My question is - what happens if I execute this query from two nodes in parallel. Is there a chance that they both get accepted?
(My use case is described in this stackoverflow question)

The effect of Paxos, is that queries get "linearized": 2 queries executed at the same time on the same row on 2 different nodes will result in one of them being executed after the other. And the second will not be applied. Obviously, both queries must use CAS for this to work. More info here and here.

It's not possible that both queries would be executed concurrently. For each query a proposal is created that will be used to reach consensus based on paxos. This will happen based on the timestamp associated with the proposal where an identical timestamp will still make on of the two proposals fail.

Related

Counting operations in Azure Monitor Log Analytics query

I want to query operations like Add-MailboxPermission with FullAccess and deleted emails/calendar events to find compromised accounts (in 30m intervals).
1. How I should modify my code to show operations which fulfil both assumptions at the same time (if I change "or" to "and" then it will check both assumptions in one log)?
2. How can I modify a "count" to decrease the number of logs only to this which show min 10 in the result? Maybe there should be another function?
OfficeActivity
| where parse_json(Parameters)[2].Value like "FullAccess" or Operation contains "Delete"
| summarize Events=count() by bin(TimeGenerated, 30m), Operation, UserId
Welcome to Stack Overflow!
Yes, the logical and operator returns true only if both conditions are true. Check this doc for the query language reference.
Yes again, there is the top operator that's used to return the first N records sorted by the specified columns, used as follows:
OfficeActivity
| where parse_json(Parameters)[2].Value like "FullAccess" and Operation contains "Delete"
| summarize Events=count() by bin(TimeGenerated, 30m), Operation, UserId
| top 10 by Events asc
Additional tip:
There are limit and take operators as well that return resultset up to the specified number of rows, but with a caveat that there is no guarantee as to which records are returned, unless the source data is sorted.
Hope this helps!

Sum aggregation for each columns in cassandra

I have a Data model like below,
CREATE TABLE appstat.nodedata (
nodeip text,
timestamp timestamp,
flashmode text,
physicalusage int,
readbw int,
readiops int,
totalcapacity int,
writebw int,
writeiops int,
writelatency int,
PRIMARY KEY (nodeip, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
where, nodeip - primary key and timestamp - clustering key (Sorted by descinding oder to get the latest),
Sample data in this table,
SELECT * from nodedata WHERE nodeip = '172.30.56.60' LIMIT 2;
nodeip | timestamp | flashmode | physicalusage | readbw | readiops | totalcapacity | writebw | writeiops | writelatency
--------------+---------------------------------+-----------+---------------+--------+----------+---------------+---------+-----------+--------------
172.30.56.60 | 2017-12-08 06:13:07.161000+0000 | yes | 34 | 57 | 19 | 27 | 8 | 89 | 57
172.30.56.60 | 2017-12-08 06:12:07.161000+0000 | yes | 70 | 6 | 43 | 88 | 79 | 83 | 89
This is properly available and whenever I need to get the statistics I am able to get the data using the partition key like below,
(The above logic seems similar to my previous question : Aggregation in Cassandra across partitions) but expectation is different,
I have value for each column (like readbw, latency etc.,) populated for every one minute in all the 4 nodes.
Now, If I need to get the max value for a column (Example : readbw), It is possible using the following query,
SELECT max(readbw) FROM nodedata WHERE nodeip IN ('172.30.56.60','172.30.56.61','172.30.56.60','172.30.56.63') AND timestamp < 1512652272989 AND timestamp > 1512537899000;
1) First question : Is there a way to perform max aggregation on all nodes of a column (readbw) without using IN query?
2) Second question : Is there a way in Cassandra, whenever I insert the data in Node 1, Node 2, Node 3 and Node 4.
It needs to be aggregated and stored in another table. So that I will collect the aggregated value of each column from the aggregated table.
If any of my point is not clear, please let me know.
Thanks,
Harry
If you are dse Cassandra you can enable spark and write the aggregation queries
Disclaimer. In your question you should define restrictions to query speed. Readers do not know whether you're trying to show this in real time, or is it more for analytical purposes. It's also not clear on how much data you're operating and the answers might depend on that.
Firstly decide whether you want to do aggregation on read or write. This largely depends on your read/write patterns.
1) First question: (aggregation on read)
The short answer is no - it's not possible. If you want to use Cassandra for this, the best approach would be doing aggregation in your application by reading each nodeip with timestamp restriction. That would be slow. But Cassandra aggregations are also potentially slow. This warning exists for a reason:
Warnings :
Aggregation query used without partition key
I found C++ Cassandra driver to be the fastest option, if you're into that.
If your data size allows, I'd look into using other databases. Regular old MySQL or Postgres will do the job just fine, unless you have terabytes of data. There's also influx DB if you want a more exotic one. But I'm getting off-topic here.
2) Second question: (aggregation on write)
That's the approach I've been using for a while. Whenever I need some aggregations, I would do them in memory (redis) and then flush to Cassandra. Remember, Cassandra is super efficient at writing data, don't be afraid to create some extra tables for your aggregations. I can't say exactly how to do this for your data, as it all depends on your requirements. It doesn't seem feasible to provide results for arbitrary timestamp intervals when aggregating on write.
Just don't try to put large sets of data into a single partition. You're better of with traditional SQL databases then.

Cassandra - IN or TOKEN query for querying an entire partition?

I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.
You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.

Traversing the optimum path between nodes

in a graph where there are multiple path to go from point (:A) to (:B) through node (:C), I'd like to extract paths from (:A) to (:B) through nodes of type (c:C) where c.Value is maximum. For instance, connect all movies with only their oldest common actors.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, max(a.Age)
The above query returns the proper age for the oldest actor, but not always his correct name.
Conversely, I noticed that the following query returns both correct age and name.
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
with m1, m2, a order by a.age desc
return m1.name, m2.name, a.name, max(a.age), head(collect(a.name))
Would this always be true? I guess so.
I there a better way to do the job without sorting which may cost much?
You need to use ORDER BY ... LIMIT 1 for this:
match p=(m1:Movie) <-[:ACTED_IN]- (a:Actor) -[:ACTED_IN]-> (m2:Movie)
return m1.Name, m2.Name, a.Name, a.Age order by a.Age desc limit 1
Be aware that you basically want to do a weighted shortest path. Neo4j can do this more efficiently using java code and the GraphAlgoFactory, see the chapter on this in the reference manual.
For those who are willing to do similar things, consider read this post from #_nicolemargaret which describe how to extract the n oldest actors acting in pairs of movies rather than just the first, as with head(collect()).

Cassandra Primary Key Selection to Bound Partition Growth

We are currently testing Cassanda as a database for a big amount of meta-data on communication events. As most queries will be limited to a single customer, it makes sense to shard by customer ID. However, this would mean the partitions would keep growing infinitely over time. I'm struggling a bit to come up with a solution that seems clean enough.
The first idea is to use a composite key of customer ID and some time interval. Are there other options, that might be better and grow more organically?
As we want to have as few partition reads as possible, I was thinking to simply use the year to put an upper bound on data per customer per partition. However, this would distribute the data rather unevenly, if I'm not mistaking. Could this be solved by moving to months or even weeks/days?
I'm sure that this is a problem that often comes up and I'm interested in hearing the various solutions that people put into place.
EDIT: To be more clean on the type of query, they will calculate aggregates over big time slices, per customer. Ideally, we would only have this:
PRIMARY KEY ((customer_id), timestamp)
However, as I've mentioned, this would lead to unbound growth per partition over the years.
Well a partition can hold a ton of rows, but if your volume over the years will be a concern, you could borrow an idea from hash tables. When more than one value hashes to a value, the extra values are stored as an overflow linked list.
We can extend the same idea to a partition. When a partition for a high volume customer "fills up", we add extra partitions to a list.
So you could define your table like this:
CREATE TABLE events (
cust_id int,
bucket int,
ts int,
overflow list<int> static,
PRIMARY KEY ((cust_id, bucket), ts));
For most customers, you would just set bucket to zero and use a single partition. But if the zero partition got too big, then add a 1 to the static list to indicate that you are now also storing data in bucket 1. You can then add more partitions to the list as needed.
For example:
INSERT INTO events (cust_id, bucket, ts) VALUES (123, 0, 1);
INSERT INTO events (cust_id, bucket, ts) VALUES (123, 0, 2);
SELECT * from events;
cust_id | bucket | ts | overflow
---------+--------+----+----------
123 | 0 | 1 | null
123 | 0 | 2 | null
Now imagine you want to start using a second partition for this customer, just add it to the static list:
UPDATE events SET overflow = overflow + [1] WHERE cust_id=123 and bucket=0;
INSERT INTO events (cust_id, bucket, ts) VALUES (123, 1, 3);
INSERT INTO events (cust_id, bucket, ts) VALUES (123, 1, 4);
So to check if a customer is using any overflow bucket partitions:
SELECT overflow FROM events WHERE cust_id=123 and bucket=0 limit 1;
overflow
----------
[1]
Now you can do range queries over the partitions:
SELECT * FROM events WHERE cust_id=123 and bucket IN(0,1) AND ts>1 and ts<4;
cust_id | bucket | ts | overflow
---------+--------+----+----------
123 | 0 | 2 | [1]
123 | 1 | 3 | null
You could define "bucket" to have whatever meaning you wanted, like year or something. Note that the overflow list is defined as static, so it is only stored once with each partition and not with each event row.
Probably the more conventional approach would be to partition by cust_id and year, but then you need to know the start and end years somehow in order to do queries. With the overflow approach, the first bucket is the master and has a standard known value like 0 for reads. But a drawback is you need to do a read to know which bucket to write to, but if each customer generates a large group of events during a communication session, then maybe the overhead of that wouldn't be too much.

Resources