Select from multiple partitions in Cassandra, with a composite partition key? - cassandra

In Cassandra (CQL), it's possible to query multiple partitions like for example:
create table foo(i int, j int, primary key (i));
insert into foo (i, j) values (1, 1);
insert into foo (i, j) values (2, 2);
select * from foo where i in (1, 2);
i | j
---+---
1 | 1
2 | 2
However, if foo has a composite partition key, I'm not sure if it's possible:
create table foo(i int, j int, k int, primary key ((i, j), k));
Some queries I've tried, which CQL has rejected are:
select * from foo where (i = 1 and j = 1) or (i = 2 and j = 2);
select * from foo where (i, j) in ((1, 1), (2, 2));
I've also tried:
select * from foo where i in (1, 2) and j in (1, 2);
but this is too wide of a query, since this will also return values where (i=1, j=2) or (i = 2, j=1).

It is possible to query in clause from client side using DSE Java Driver .You have to use async option
https://docs.datastax.com/en/developer/java-driver/4.10/manual/core/async/

Related

Retrieving messages from redis stream

I have a NodeJS application that is using Redis stream (library 'ioredis') to pass information around. The problem is that when I add a message to a stream and I try to retrieve it, I have to go down a lot of Arrays level:
const message = await redis.xreadgroup('GROUP', orderGroup, orderConsumer, 'COUNT', 1, 'STREAMS', orderStream, '>');
const messageId: string = message[0][1][0][0];
const pMsg: Obj = JSON.parse(JSON.parse(message[0][1][0][1][1]));
This is how I create the stream:
await redis.xgroup('CREATE', orderStream, orderGroup, '0', 'MKSTREAM')
.catch((err) => {
console.error(`Group already exists error: ${err}`);
})
Is this normal? In the Redis doc (https://redis.io/commands/xreadgroup) it shows that the return value is an array with the id of the message at position 0 and the fields at position 1. I feel like I'm missing something...
Here is an example output of XREADGROUP, as you can see the values are at the nested level 5.
127.0.0.1:6379> XREADGROUP Group g1 c1 COUNT 100 STREAMS s1 >
1) 1) "s1"
2) 1) 1) "1608445334963-0"
2) 1) "f1"
2) "v1"
3) "f2"
4) "v2"
2) 1) "1608445335464-0"
2) 1) "f1"
2) "v1"
3) "f2"
4) "v2"
3) 1) "1608445335856-0"
2) 1) "f1"
2) "v1"
3) "f2"
4) "v2"
For more details see https://redis.io/commands/xread
It is normal and expected. XREADGROUP supports reading from multiples stream keys, multiple messages, and messages can have multiple field-value pairs.
Follow the next example:
> XGROUP CREATE mystream1 mygroup 0 MKSTREAM
OK
> XGROUP CREATE mystream2 mygroup 0 MKSTREAM
OK
> XADD mystream1 * field1 value1 field2 value2
"1608444656005-0"
> XADD mystream1 * field1 value3 field2 value4
"1608444660566-0"
> XADD mystream2 * field3 value5 field4 value6
"1608444665238-0"
> XADD mystream2 * field3 value7 field4 value8
"1608444670070-0"
> XREADGROUP GROUP mygroup yo COUNT 2 STREAMS mystream1 mystream2 > >
1) 1) "mystream1"
2) 1) 1) "1608444656005-0"
2) 1) "field1"
2) "value1"
3) "field2"
4) "value2"
2) 1) "1608444660566-0"
2) 1) "field1"
2) "value3"
3) "field2"
4) "value4"
2) 1) "mystream2"
2) 1) 1) "1608444665238-0"
2) 1) "field3"
2) "value5"
3) "field4"
4) "value6"
2) 1) "1608444670070-0"
2) 1) "field3"
2) "value7"
3) "field4"
4) "value8"
The structure you get has multiple nested arrays. Using 0-indexed as in node:
[index of the stream key]
[0: the key name or 1: an array for messages]
[index of the message]
[0: the message ID or 1: an array for fields & values]
[even for field name or odd for value]
Where data[0][1] is the root level array (adjust this entry point for your own use).
Variables
rd: Return Data
el: Element
sel: Sub-element
rel: Relative-element
p: Relative-Object
c: Iterate-Counter for deciding if it is a key or value.
var rd = []
for(var el of data[0][1]){
var sel = el[1]
var p = {}
var c = 0
for(var rel of sel){
if(c % 2 == 0){
// Right here is where the return object keys/values are set.
p[rel] = sel[c + 1]
}
c++
}
rd.push(p)
}
return rd

Multithreading in scala

I was given a challenge recently in school to create a simple program in Scala the does some calculations in a matrix, the thing is I have to do these calculations using 5 threads, since I had no prior knowledge of Scala I am stuck. I searched online but I did not find how to create the exact number of threads I want. This is the code:
import scala.math
object Test{
def main(args: Array[String]){
val M1: Seq[Seq[Int]] = List(
List(1, 2, 3),
List(4, 5, 6),
List(7, 8, 9)
)
var tempData : Float= 0
var count:Int = 1
var finalData:Int=0
for(i<-0 to M1.length-1; j<-0 to M1(0).length-1){
count = 1
tempData = M1(i)(j)+ calc(i-1,j)+calc(i,j-1)+calc(i+1,j)
finalData = math.ceil(tempData/count).toInt
printf("%d ", finalData)
}
def calc(i:Int, j:Int): Int ={
if((i<0)|| (j<0) || (i>M1.length-1))
return 0
else{
count +=1
return M1(i)(j)}
}
}
I tried this:
for (a <- 0 until 1) {
val thread = new Thread {
override def run {
for(i<-0 to M1.length-1; j<-0 to M1(0).length-1){
count = 1
tempData = M1(i)(j)+ calc(i-1,j)+calc(i,j-1)+calc(i+1,j)
finalData = math.ceil(tempData/count).toInt
printf("%d ", finalData)
}
}
}
thread.start
}
but it only executed the same thing 10 times
Here's the original core of the calculation.
for(i<-0 to M1.length-1; j<-0 to M1(0).length-1){
count = 1
tempData = M1(i)(j)+ calc(i-1,j)+calc(i,j-1)+calc(i+1,j)
finalData = math.ceil(tempData/count).toInt
printf("%d ", finalData)
}
Let's actually build a result array
val R = Array.ofDim[Int](M1.length, M1(0).length)
var tempData : Float= 0
var count:Int = 1
var finalData:Int=0
for(i<-0 to M1.length-1; j<-0 to M1(0).length-1){
count = 1
tempData = M1(i)(j)+ calc(i-1,j)+calc(i,j-1)+calc(i+1,j)
R(i)(j) = math.ceil(tempData/count).toInt
}
Now, that mutable count modified in one function and referenced in another is a bit of a code smell. Let's remove it - change calc to return an option, assemble a list of the things to average, and flatten to keep only the Some
val R = Array.ofDim[Int](M1.length, M1(0).length)
for (i <- 0 to M1.length - 1; j <- 0 to M1(0).length - 1) {
val tempList = List(Some(M1(i)(j)), calc(i - 1, j), calc(i, j - 1), calc(i + 1, j)).flatten
R(i)(j) = math.ceil(tempList.sum.toDouble / tempList.length).toInt
}
def calc(i: Int, j: Int): Option[Int] = {
if ((i < 0) || (j < 0) || (i > M1.length - 1))
None
else {
Some(M1(i)(j))
}
}
Next, a side-effecting for is a bit of a code smell too. So in the inner loop, let's produce each row and in the outer loop a list of the rows...
val R = for (i <- 0 to M1.length - 1) yield {
for (j <- 0 to M1(0).length - 1) yield {
val tempList = List(Some(M1(i)(j)), calc(i - 1, j), calc(i, j - 1), calc(i + 1, j)).flatten
math.ceil(tempList.sum / tempList.length).toInt
}
}
Now, we read the Scala API and we notice ParSeq and Seq.par so we'd like to work with map and friends. So let's un-sugar the for comprehensions
val R = (0 until M1.length).map { i =>
(0 until M1(0).length).map { j =>
val tempList = List(Some(M1(i)(j)), calc(i - 1, j), calc(i, j - 1), calc(i + 1, j)).flatten
math.ceil(tempList.sum / tempList.length).toInt
}
}
This is our MotionBlurSingleThread. To make it parallel, we simply do
val R = (0 until M1.length).par.map { i =>
(0 until M1(0).length).par.map { j =>
val tempList = List(Some(M1(i)(j)), calc(i - 1, j), calc(i, j - 1), calc(i + 1, j)).flatten
math.ceil(tempList.sum / tempList.length).toInt
}.seq
}.seq
And this is our MotionBlurMultiThread. And it is nicely functional too (no mutable values)
The limit to 5 or 10 threads isn't in the challenge on Github, but if you need to do that you can look at scala parallel collections degree of parallelism and related questions
I am not an expert, neither on Scala nor on concurrency.
Scala approach to concurrency is through the use of actors and messaging, you can read a little about that here, Programming in Scala, chapter 30 Actors and Concurrency (the first edition is free but it is outdated). As I was telling, the edition is outdated and in the latest version of Scala (2.12) the actors library is no longer included, and they recommend to use Akka, you can read about that here.
So, I would not recommend learning about Scala, Sbt and Akka just for a challenge, but you can download an Akka quickstart here and customize the example given to your needs, it is nicely explained in the link. Each instance of the Actor has his own thread. You can read about actors and threads here, in specific, the section about state.

Accessing rows outside of window while aggregating in Spark dataframe

In short, in the example below I want to pin 'b to be the value in the row that the result will appear in.
Given:
a,b
1,2
4,6
3,7 ==> 'special would be: (1-7 + 4-7 + 3-7) == -13 in this row
val baseWin = Window.partitionBy("something_I_forgot").orderBy("whatever")
val sumWin = baseWin.rowsBetween(-2, 0)
frame.withColumn("special",sum( 'a - 'b ).over(win) )
Or another way to think of it is I want to close over the row when I calculate the sum so that I can pass in the value of 'b (in this case 7)
* Update *
Here is what I want to accomplish as an UDF. In short, I used a foldLeft.
def mad(field : Column, numPeriods : Integer) : Column = {
val baseWin = Window.partitionBy("exchange","symbol").orderBy("datetime")
val win = baseWin.rowsBetween(numPeriods + 1, 0)
val subFunc: (Seq[Double],Int) => Double = { (input: Seq[Double], numPeriods : Int) => {
val agg = grizzled.math.stats.mean(input: _*)
val fooBar = (1.0 / -numPeriods)*input.foldLeft(0.0)( (a,b) => a + Math.abs((b-agg)) )
fooBar
} }
val myUdf = udf( subFunc )
myUdf(collect_list(field.cast(DoubleType)).over(win),lit(numPeriods))
}
If I understood correctly what you're trying to do, I think you can refactor your logic a bit to achieve it. The way you have it right now, you're probably getting "-7" instead of -13.
For the "special" column, (1-7 + 4-7 + 3-7), you can calculate it like (sum(a) - count(*) * b):
dfA.withColumn("special",sum('a).over(win) - count("*").over(win) * 'b)

How do I add two column values in a table with CQL?

I am needing to add two values together to create a third value with CQL. Is there any way to do this? My table has the columns number_of_x and number_of_y and I am trying to create total. I did an update on the table with a set command as follows:
UPDATE my_table
SET total = number_of_x + number_of_y ;
When I run that I get the message back saying:
no viable alternative at input ';'.
Per docs. assignment is one of:
column_name = value
set_or_list_item = set_or_list_item ( + | - ) ...
map_name = map_name ( + | - ) ...
map_name = map_name ( + | - ) { map_key : map_value, ... }
column_name [ term ] = value
counter_column_name = counter_column_name ( + | - ) integer
And you cannot mix counter and non counter columns in the same table so what you are describing is impossible in a single statement. But you can do a read before write:
CREATE TABLE my_table ( total int, x int, y int, key text PRIMARY KEY )
INSERT INTO my_table (key, x, y) VALUES ('CUST_1', 1, 1);
SELECT * FROM my_table WHERE key = 'CUST_1';
key | total | x | y
--------+-------+---+---
CUST_1 | null | 1 | 1
UPDATE my_table SET total = 2 WHERE key = 'CUST_1' IF x = 1 AND y = 1;
[applied]
-----------
True
SELECT * FROM my_table WHERE key = 'CUST_1';
key | total | x | y
--------+-------+---+---
CUST_1 | 2 | 1 | 1
The IF clause will handle concurrency issues if x or y was updated since the SELECT. You can than retry again if applied is False.
My recommendation however in this scenario is for your application to just read both x and y, then do addition locally as it will perform MUCH better.
If you really want C* to do the addition for you, there is a sum aggregate function in 2.2+ but it will require updating your schema a little:
CREATE TABLE table_for_aggregate (key text, type text, value int, PRIMARY KEY (key, type));
INSERT INTO table_for_aggregate (key, type, value) VALUES ('CUST_1', 'X', 1);
INSERT INTO table_for_aggregate (key, type, value) VALUES ('CUST_1', 'Y', 1);
SELECT sum(value) from table_for_aggregate WHERE key = 'CUST_1';
system.sum(value)
-------------------
2

Using PartitionBy to split and efficiently compute RDD groups by Key

I've implemented a solution to group RDD[K, V] by key and to compute data according to each group (K, RDD[V]), using partitionBy and Partitioner. Nevertheless, I'm not sure if it is really efficient and I'd like to have your point of view.
Here is a sample case : according to a list of [K: Int, V: Int], compute the Vs mean for each group of K, knowing that it should be distributed and that V values may be very large. That should give :
List[K, V] => (K, mean(V))
The simple Partitioner class:
class MyPartitioner(maxKey: Int) extends Partitioner {
def numPartitions = maxKey
def getPartition(key: Any): Int = key match {
case i: Int if i < maxKey => i
}
}
The partition code :
val l = List((1, 1), (1, 8), (1, 30), (2, 4), (2, 5), (3, 7))
val rdd = sc.parallelize(l)
val p = rdd.partitionBy(new MyPartitioner(4)).cache()
p.foreachPartition(x => {
try {
val r = sc.parallelize(x.toList)
val id = r.first() //get the K partition id
val v = r.map(x => x._2)
println(id._1 + "->" + mean(v))
} catch {
case e: UnsupportedOperationException => 0
}
})
The output is :
1->13, 2->4, 3->7
My questions are :
what does it really happen when calling partitionBy ? (sorry, I didn't find enough specs on it)
Is it really efficient to map by partition, knowing that in my production case it would not be too much keys (as 50 for sample) by very much values (as 1 million for sample)
What is the cost of paralellize(x.toList) ? Is it consistent to do it ? (I need a RDD in input of mean())
How would you do it by yourself ?
Regards
Your code should not work. You cannot pass the SparkContext object to the executors. (It's not Serializable.) Also I don't see why you would need to.
To calculate the mean, you need to calculate the sum and the count and take their ratio. The default partitioner will do fine.
def meanByKey(rdd: RDD[(Int, Int)]): RDD[(Int, Double)] = {
case class SumCount(sum: Double, count: Double)
val sumCounts = rdd.aggregateByKey(SumCount(0.0, 0.0))(
(sc, v) => SumCount(sc.sum + v, sc.count + 1.0),
(sc1, sc2) => SumCount(sc1.sum + sc2.sum, sc1.count + sc2.count))
sumCounts.map(sc => sc.sum / sc.count)
}
This is an efficient single-pass calculation that generalizes well.

Resources