Cassandra update max(a,b) function - cassandra

How to make a request of this type in cassandra?
UPDATE my_table SET my_column1 = MAX(my_column1, 100) and my_column2 = my_column2 + 10;
max() function not exist. Can by using apache spark do this?
thanks!

MAX is idempotent and seems like it is simple to do in this case the problem is that C* is a general database and needs to handle some edge cases. Particularly an issue is with deletes and TTLs, since as old data goes away it needs to still maintain the max.
A couple ways you can do this is either create a value that you update on inserts atomically or keep all the values inserted around in order so as things delete/ttl the old ones are still there to take its place (at the obvious disk cost).
CREATE TABLE my_table_max (
key text,
max int static,
deletableMax int,
PRIMARY KEY (key, deletableMax)
) WITH CLUSTERING ORDER BY (deletableMax DESC);
Then atomically update your max, or for the deletable implementation insert the new value:
BEGIN BATCH
INSERT INTO my_table_max (key, max) VALUES ('test', 1) IF NOT EXISTS;
INSERT INTO my_table_max (key, deletableMax) VALUES ('test', 1);
APPLY BATCH;
BEGIN BATCH
UPDATE my_table_max SET max = 5 WHERE key='test' IF max = 1;
INSERT INTO my_table_max (key, deletableMax) VALUES ('test', 5);
APPLY BATCH;
then just querying top 1 gives you the max:
select * from my_table_max limit 1;
key | deletableMax | max
------+--------------+-----
test | 5 | 5
Difference between these two would be seen after a delete:
delete from my_table_max WHERE key = 'test' and deletablemax = 5;
cqlsh:test_ks> select * from my_table_max limit 1;
key | deletablemax | max
------+--------------+-----
test | 1 | 5
Since it keeps track of all the values in order the older value is kept;

Related

Searching in collection of tuple - Cassandra

Here is my data :-
CREATE TABLE collect_things(k int PRIMARY KEY,n set<frozen<tuple<text, text>>>);
INSERT INTO collect_things (k, n) VALUES(1, {('hello', 'cassandra')});
CREATE INDEX n_index ON collect_things (n);
Now I have to query like this :-
SELECT * FROM collect_things WHERE n contains ('cassandra') ALLOW FILTERING ;
Output :-
k | n
---+---------
Expected output :-
k | n
---+---------
1 | {('hello', 'cassandra')}
I want to fetch my data with 'cassandra' value . Is this possible ?
Tuple inside Collection must be define as frozen.
A frozen value serializes multiple components into a single value. Non-frozen types allow updates to individual fields. Cassandra treats the value of a frozen type as a blob. The entire value must be overwritten.
You must treat frozen as a single value and you can't separate them. So when querying provide the complete frozen tuple ('hello', 'cassandra').
SELECT * FROM collect_things WHERE n CONTAINS ('hello', 'cassandra');
If you have the data :
k | n
---+---------------------------------------------
1 | {('hello', 'cassandra'), ('test', 'seach')}
2 | {('test', 'seach')}
Output :
k | n
---+---------------------------------------------
1 | {('hello', 'cassandra'), ('test', 'seach')}
Source : https://docs.datastax.com/en/cql/3.1/cql/cql_reference/collection_type_r.html

set minimum counter value in cassandra counter table

In cassandra is there any way to define minimum value of a counter in counter column. Let say when counter values reaches 0 it should not go below this even if i do decrement operation.
There isn't. Your counter is initialized with a value of 0, you can increment it, decrement it, and query its value. It is an integer between (-2^64 and 2^63 - 1). Sum / substraction will overflow when you hit the min / max values.
If you try to handle the logic in your application, it would be easy if you only have 1 application who write, but you probably have more than one. It would be doable if your applications are on the same system where they can use a lock, again I am guessing that's not the case, plus the performance would drop. In a distributed environment, you would need to be able to get a distributed lock, the performance would suffer.
If you really want to achieve this functionality with Cassandra, you can emulate it with the following strategy:
1. Table definition
CREATE TABLE test.counter (
my_key tinyint,
my_random uuid,
my_operation int,
my_non_key tinyint,
PRIMARY KEY ((my_key), my_operation, my_random)
);
This table will be used to keep track of the increment / decrement operation you are running. A few notes:
The partition key my_key will be always used with the same value: 0. It is used to collocate all the operations (incremente / decremente) in the same partition key.
The my_random value must be a random value generated without any
chance of collision. uuid can be used for that. Without this column, executing twice the same operation (such as increment by 10) will be only stored once. Each operation will have its own uuid.
my_operation keeps track of the increment / decrement value you execute.
my_non_key is a dummy column that we are going to use to query the write timestamp, as we cannot query it on the primary key columns. We will always set my_non_key to 0.
2. Counter initialization
You can initialize your counter, say to zero, with:
INSERT INTO test.counter (my_key , my_random, my_operation, my_non_key ) VALUES
( 0, 419ec9cc-ef53-4767-942e-7f0bf9c63a9d, 0, 0);
3. Counter increment
Let's say you add some number, such as 10. You would do so by inserting with the same partition key 0, a new random uuid, and a value of 10:
INSERT INTO test.counter (my_key , my_random, my_operation, my_non_key) VALUES
( 0, d2c68d2a-9e40-486b-bb69-42a0c1d0c506, 10, 0);
4. Counter decrement
Let's say you substract 15 now:
INSERT INTO test.counter (my_key , my_random, my_operation, my_non_key ) VALUES
( 0, e7a5c52c-e1af-408f-960e-e98c48504dac, -15, 0);
5. Counter increment
Let's say you add 1 now:
INSERT INTO test.counter (my_key , my_random, my_operation, my_non_key ) VALUES
( 0, 980554e6-5918-4c8d-b935-dde74e02109b, 1, 0);
6. Counter query
Now, let's say you want to query your counter, you would need to run:
SELECT my_operation, writetime(my_non_key), my_random FROM test.counter WHERE my_key = 0;
which will return: 0; 10; -15; 1 with the timestamp at which it was written. Your application now has all the information to calculate the correct value, since it knows in which order the incremente / decremente operations occured. This is of course necessary when the counter is reaching zero towards negative values. In this case, your application should be able to calculate that the right value which is 1.
6. Cleaning up
At regular interval, or when you query the counter, you could combine values together and delete old one in a batch statement to ensure atomicity, for example:
BEGIN BATCH
DELETE FROM test.counter WHERE my_key = 0 AND my_operation = -5 and my_random = e7a5c52c-e1af-408f-960e-e98c48504dac;
DELETE FROM test.counter WHERE my_key = 0 AND my_operation = 0 and my_random = 419ec9cc-ef53-4767-942e-7f0bf9c63a9d;
DELETE FROM test.counter WHERE my_key = 0 AND my_operation = 10 and my_random = d2c68d2a-9e40-486b-bb69-42a0c1d0c506;
INSERT INTO test.counter (my_key , my_random, my_operation, my_non_key ) VALUES (0, ca67df54-62c7-4d31-a79c-a0011439b486, 1, 0);
APPLY BATCH;
Final notes
Performance-wise, this should be acceptable, as write are cheaps, and reads are done on a single partition.
Cassandra is an eventual-consistent DB. This means that this counter is also eventualy consistent. If you need strong consistency, you will need to tune your read/write consistency correctly:
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlConfigConsistency.html

Referencing the next entry in RDD within a map function

I have a stream of <id, action, timestamp, data>s to process.
For example, (let us assume there's only 1 id for simplicity)
id event timestamp
-------------------------------
1 A 1
1 B 2
1 C 4
1 D 7
1 E 15
1 F 16
Let's say TIMEOUT = 5. Because more than 5 seconds passed after D happened without any further event, I want to map this to a JavaPairDStream with two key : value pairs.
id1_1:
A 1
B 2
C 4
D 7
and
id1_2:
E 15
F 16
However, in my anonymous function object, PairFunction that I pass to mapToPair() method,
incomingMessages.mapToPair(new PairFunction<String, String, RequestData>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, RequestData> call(String s) {
I cannot reference the data in the next entry. In other words, when I am processing the entry with event D, I cannot look at the data at E.
If this was not Spark, I could have simply created an array timeDifferences, store the differences in two adjacent timestamps, and split the array into parts whenever I see a time difference in timeDifferences that is larger than TIMEOUT. (Although, actually there's no need to explicitly create an array)
How can I do this in Spark?
I'm still struggling to understand your question a bit, but based on what you've written, I think you can do it this way:
val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
val B = A.map(x=>(x._1-1,x._2))
val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))
So the concept is you zip with index to create val A (which assigns a serial long number to each entry of your RDD), and duplicate the RDD but with the index of the consecutive entry to create val B (by subtracting 1 from the index), then use a join to work out the TIMEOUT between consecutive entries. Then use Filter. This method uses RDD. A easier way is to collect them into the Master and use Map or zipped mapping, but it would be scala not spark I guess.
I believe this does what you need:
def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})
// joining the two to attach a "followingGap" to each event
val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
})
// collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
// if this collection is very large, another join might be needed
val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()
// going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}
case class Event(timestamp: Long, data: String)
case class ExtendedEvent(event: Event, followingGap: Long)
The first part builds on GameOfThrows's answer - joining the input with itself with 1's offset to calculate the 'followingGap' for each record. Then we collect the "breaks" or "cutoff points" between the windows, and perform another transformation on the input using these points to group it by window.
NOTE: there might be more efficient ways to perform some of these transformations, depending on the characteristics of the input, for example: if you have lots of "sessions", this code might be slow or run out of memory.

How do I select a range of elements in Spark RDD?

I'd like to select a range of elements in a Spark RDD. For example, I have an RDD with a hundred elements, and I need to select elements from 60 to 80. How do I do that?
I see that RDD has a take(i: int) method, which returns the first i elements. But there is no corresponding method to take the last i elements, or i elements from the middle starting at a certain index.
I don't think there is an efficient method to do this yet. But the easy way is using filter(), lets say you have an RDD, pairs with key value pairs and you only want elements from 60 to 80 inclusive just do.
val 60to80 = pairs.filter {
_ match {
case (k,v) => k >= 60 && k <= 80
case _ => false //incase of invalid input
}
}
I think it's possible that this could be done more efficiently in the future, by using sortByKey and saving information about the range of values mapped to each partition. Keep in mind this approach would only save anything if you were planning to query the range multiple times because the sort is obviously expensive.
From looking at the spark source it would definitely be possible to do efficient range queries using RangePartitioner:
// An array of upper bounds for the first (partitions - 1) partitions
private val rangeBounds: Array[K] = {
This is a private member of RangePartitioner with the knowledge of all the upper bounds of the partitions, it would be easy to only query the necessary partitions. It looks like this is something spark users may see in the future: SPARK-911
UPDATE: Way better answer, based on pull request I'm writing for SPARK-911. It will run efficiently if the RDD is sorted and you query it multiple times.
val sorted = sc.parallelize((1 to 100).map(x => (x, x))).sortByKey().cache()
val p: RangePartitioner[Int, Int] = sorted.partitioner.get.asInstanceOf[RangePartitioner[Int, Int]];
val (lower, upper) = (10, 20)
val range = p.getPartition(lower) to p.getPartition(upper)
println(range)
val rangeFilter = (i: Int, iter: Iterator[(Int, Int)]) => {
if (range.contains(i))
for ((k, v) <- iter if k >= lower && k <= upper) yield (k, v)
else
Iterator.empty
}
for((k,v) <- sorted.mapPartitionsWithIndex(rangeFilter, preservesPartitioning = true).collect()) println(s"$k, $v")
If having the whole partition in memory is acceptable you could even do something like this.
val glommedAndCached = sorted.glom()cache();
glommedAndCached.map(a => a.slice(a.search(lower),a.search(upper)+1)).collect()
search is not a member BTW I just made an implicit class that has a binary search function, not shown here
How big is your data set? You might be able to do what you need with:
data.take(80).drop(59)
This seems inefficient, but for small to medium-sized data, should work.
Is it possible to solve this in another way? What's the case for picking exactly a certain range out of the middle of your data? Would takeSample serve you better?
Following should be able to get the range. Note the cache will save you some overhead, because internally zipWithIndex need to scan the RDD partition to get the number of elements in each partition.
scala>val r1 = sc.parallelize(List("a", "b", "c", "d", "e", "f", "g"), 3).cache
scala>val r2 = r1.zipWithIndex
scala>val r3 = r2.filter(x=> {x._2>2 && x._2 < 4}).map(x=>x._1)
scala>r3.foreach(println)
d
For those who stumble on this question looking for Spark 2.x-compatible answer, you can use filterByRange

How to create a lookup table in Groovy?

I want to create a lookup table in Groovy, given a size (in this case the size is of 4):
RGGG
RRGG
RRRG
RRRR
That is in first iteration only one R should be there and size-1 times of G should be there. As per the iteration value increases R should grow and G should decrease as well. So for size 4 I will have 4 lookup values.
How one could do this in Groovy?
You mean like this:
def lut( width, a='R', b='G' ) {
(1..width).collect { n ->
( a * n ) + ( b * ( width - n ) )
}
}
def table = lut( 4 )
table.each { println it }
prints:
RGGG
RRGG
RRRG
RRRR
Your question doesn't really say what sort of data you are expecting out? This code gives a List of Strings

Resources