Searching in collection of tuple - Cassandra - cassandra

Here is my data :-
CREATE TABLE collect_things(k int PRIMARY KEY,n set<frozen<tuple<text, text>>>);
INSERT INTO collect_things (k, n) VALUES(1, {('hello', 'cassandra')});
CREATE INDEX n_index ON collect_things (n);
Now I have to query like this :-
SELECT * FROM collect_things WHERE n contains ('cassandra') ALLOW FILTERING ;
Output :-
k | n
---+---------
Expected output :-
k | n
---+---------
1 | {('hello', 'cassandra')}
I want to fetch my data with 'cassandra' value . Is this possible ?

Tuple inside Collection must be define as frozen.
A frozen value serializes multiple components into a single value. Non-frozen types allow updates to individual fields. Cassandra treats the value of a frozen type as a blob. The entire value must be overwritten.
You must treat frozen as a single value and you can't separate them. So when querying provide the complete frozen tuple ('hello', 'cassandra').
SELECT * FROM collect_things WHERE n CONTAINS ('hello', 'cassandra');
If you have the data :
k | n
---+---------------------------------------------
1 | {('hello', 'cassandra'), ('test', 'seach')}
2 | {('test', 'seach')}
Output :
k | n
---+---------------------------------------------
1 | {('hello', 'cassandra'), ('test', 'seach')}
Source : https://docs.datastax.com/en/cql/3.1/cql/cql_reference/collection_type_r.html

Related

Spark group-by aggregation in a dataset containing a map

I have a java POJO
class MyObj{
String id;
Map<KeyObj, ValueObj> mapValues;
//getters and //setters(omitted)
}
I have a spark dataset
Dataset<MyObj> myDs = .....
My dataset has a list of values but there are duplicate Ids. How do I combine the duplicate account Ids and aggregate the Key values pairs into one Map for that Id using Spark groupBy.
Thanks for your help.
So I have:
ID. Map
----------------------------------
1000 [(w -> wer), (D -> dfr)]
1000 [(g -> gde)]
1001 [(k -> khg), (v -> vsa)]
And I need this:
ID. Map
----------------------------------
1000 [(w -> wer), (D -> dfr), (g -> gde)]
1001 [(k -> khg), (v -> vsa)]
You can explode the original maps so that each entry of each map is a row of its own. Then you can group over the id column and restore the maps with map_from_arrays:
myDs.select(col("id"),explode(col("mapValues"))) //1
.groupBy("id")
.agg(collect_list("key").as("keys"), collect_list("value").as("values")) //2
.withColumn("map", map_from_arrays(col("keys"), col("values"))) //3
.drop("keys", "values") //4
.show(false);
explode the maps into single rows. The new column names will be key and value
When grouping by id collect all keys and values into arrays, resulting in one array with keys and one array with values per id
Use map_from_arrays to transform the keys and values arrays back into a single map
drop the intermediate columns
The result is
+----+------------------------------+
|id |map |
+----+------------------------------+
|1000|[D -> dfr, w -> wer, g -> gde]|
|1001|[v -> vsa, k -> khg] |
+----+------------------------------+

Spark join dataframes & datasets

I have a DataFrame called Link with a dynamic amount of fields/columns in a Row.
Some fields however had the structure [ClassName]Id that contain an id
[ClassName]Id's are always of type String
I have a couple of Datasets each of a different type [ClassName]
Each Dataset has at least fields id (String) and typeName (String), which is always filled with the String value of the [ClassName]
e.g. If I would have 3 DataSets of type A, B and C
Link:
+----+-----+-----+-----+
| id | AId | BId | CId |
+----+-----+-----+-----+
| XX | A01 | B02 | C04 |
| XY | null| B05 | C07 |
A:
+-----+----------+-----+-----+
| id | typeName | ... | ... |
+-----+----------+-----+-----+
| A01 | A | ... | ... |
B:
+-----+----------+-----+-----+
| id | typeName | ... | ... |
+-----+----------+-----+-----+
| B02 | B | ... | ... |
The preferred end result would be the Link Dataframe where each Id is either replace or appended by a field called [ClassName] With the original object encapsulated.
Result:
+----+----------------+----------------+----------------+
| id | A | B | C |
+----+----------------+----------------+----------------+
| XX | A(A01, A, ...) | B(B02, B, ...) | C(C04, C, ...) |
| XY | null | B(B05, B, ...) | C(C07, C, ...) |
Things I've tried
Recursive Call on joinWith.
The first call succeeds returning a tuple/Row where the first element is the original Row and the second the matched [ClassName]
However the second iteration starts nesting these results.
Trying to 'unnest' these results using map either results in Encoder hell (since the resulting Row is not a fixed type) or the Encoding is so complex that it results in a catalyst error
join as RDD Can't work this one out yet.
Any ideas are welcome.
So I figured out how I could do what I want.
I made some changes for it to work for me, but it's a
For reference purpose I will show my steps, maybe it can be useful for someone in the future?
First I declare a datatype that shares all properties of A, B, C, etc. that I'm Interested in and make the classes extend from this super type
case class Base(id: String, typeName: String)
case class A(override val id: String, override val typeName: String) extends Base(id, typeName)
Next I load the link Dataframe
val linkDataFrame = spark.read.parquet("[path]")
I want to convert this DataFrame in something joinable, this means creating a placeholder for the joined sources and a way to convert all the single Id fields (AId, BId, etc) into a Map of source -> id's. Spark has a sql map method that is useful. Also we need to convert the Base class to a StructType for use in the encoder. Tried multiple ways, but couldn't circumvent specific declaration (otherwise casting errors)
val linkDataFrame = spark.read.parquet("[path]")
case class LinkReformatted(ids: Map[String, Long], sources: Map[String, Base])
// Maps each column ending with Id into a Map of (columnname1 (-Id), value1, columnname2 (-Id), value2)
val mapper = linkDataFrame.columns.toList
.filter(
_.matches("(?i).*Id$")
)
.flatMap(
c => List(lit(c.replaceAll("(?i)Id$", "")), col(c))
)
val baseStructType = ScalaReflection.schemaFor[Base].dataType.asInstanceOf[StructType]
All these parts made it possible to create a new DataFrame with the Id's all in one field called ids and a placeholder for the sources in an empty Map[String, Base]
val linkDatasetReformatted = linkDataFrame.select(
map(mapper: _*).alias("ids")
)
.withColumn("sources", lit(null).cast(MapType(StringType, baseStructType)))
.as[LinkReformatted]
The next step was to join all source Datasets (A,B, etc) to this reformatted Link dataset. A lot of stuff happens in this tailrecursive method
#tailrec
def recursiveJoinBases(sourceDataset: Dataset[LinkReformatted], datasets: List[Dataset[Base]]): Dataset[LinkReformatted] = datasets match {
case Nil => sourceDataset // Nothing left to join, return it
case baseDataset :: remainingDatasets => {
val typeName = baseDataset.head.typeName // extract the type from base (each field hase same value)
val masterName = "source" // something to name the source
val joinedDataset = sourceDataset.as(masterName) // joining source
.joinWith(
baseDataset.as(typeName), // with a base A,B, etc
col(s"$typeName.id") === col(s"$masterName.ids.$typeName"), // join on source.ids.[typeName]
"left_outer"
)
.map {
case (source, base) => {
val newSources = if (source.sources == null) Map(typeName -> base) else source.sources + (typeName -> base) // append or create map of sources
source.copy(sources = newSources)
}
}
.as[LinkReformatted]
recursiveJoinBases(joinedDataset, remainingDatasets)
}
}
You now end up with a Dataset of LinkReformatted records where for each corresponding typeName -> id in the ids field is a corresponding typeName -> Base in the sources field.
For me that was enough. I could extract everything I needed using some map function over this final Dataset
I hope this somewhat helps. I understand it's not the exact solution I was asking about, nor is it all very straightforward.

Referencing the next entry in RDD within a map function

I have a stream of <id, action, timestamp, data>s to process.
For example, (let us assume there's only 1 id for simplicity)
id event timestamp
-------------------------------
1 A 1
1 B 2
1 C 4
1 D 7
1 E 15
1 F 16
Let's say TIMEOUT = 5. Because more than 5 seconds passed after D happened without any further event, I want to map this to a JavaPairDStream with two key : value pairs.
id1_1:
A 1
B 2
C 4
D 7
and
id1_2:
E 15
F 16
However, in my anonymous function object, PairFunction that I pass to mapToPair() method,
incomingMessages.mapToPair(new PairFunction<String, String, RequestData>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, RequestData> call(String s) {
I cannot reference the data in the next entry. In other words, when I am processing the entry with event D, I cannot look at the data at E.
If this was not Spark, I could have simply created an array timeDifferences, store the differences in two adjacent timestamps, and split the array into parts whenever I see a time difference in timeDifferences that is larger than TIMEOUT. (Although, actually there's no need to explicitly create an array)
How can I do this in Spark?
I'm still struggling to understand your question a bit, but based on what you've written, I think you can do it this way:
val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
val B = A.map(x=>(x._1-1,x._2))
val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))
So the concept is you zip with index to create val A (which assigns a serial long number to each entry of your RDD), and duplicate the RDD but with the index of the consecutive entry to create val B (by subtracting 1 from the index), then use a join to work out the TIMEOUT between consecutive entries. Then use Filter. This method uses RDD. A easier way is to collect them into the Master and use Map or zipped mapping, but it would be scala not spark I guess.
I believe this does what you need:
def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})
// joining the two to attach a "followingGap" to each event
val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
})
// collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
// if this collection is very large, another join might be needed
val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()
// going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}
case class Event(timestamp: Long, data: String)
case class ExtendedEvent(event: Event, followingGap: Long)
The first part builds on GameOfThrows's answer - joining the input with itself with 1's offset to calculate the 'followingGap' for each record. Then we collect the "breaks" or "cutoff points" between the windows, and perform another transformation on the input using these points to group it by window.
NOTE: there might be more efficient ways to perform some of these transformations, depending on the characteristics of the input, for example: if you have lots of "sessions", this code might be slow or run out of memory.

Cassandra update max(a,b) function

How to make a request of this type in cassandra?
UPDATE my_table SET my_column1 = MAX(my_column1, 100) and my_column2 = my_column2 + 10;
max() function not exist. Can by using apache spark do this?
thanks!
MAX is idempotent and seems like it is simple to do in this case the problem is that C* is a general database and needs to handle some edge cases. Particularly an issue is with deletes and TTLs, since as old data goes away it needs to still maintain the max.
A couple ways you can do this is either create a value that you update on inserts atomically or keep all the values inserted around in order so as things delete/ttl the old ones are still there to take its place (at the obvious disk cost).
CREATE TABLE my_table_max (
key text,
max int static,
deletableMax int,
PRIMARY KEY (key, deletableMax)
) WITH CLUSTERING ORDER BY (deletableMax DESC);
Then atomically update your max, or for the deletable implementation insert the new value:
BEGIN BATCH
INSERT INTO my_table_max (key, max) VALUES ('test', 1) IF NOT EXISTS;
INSERT INTO my_table_max (key, deletableMax) VALUES ('test', 1);
APPLY BATCH;
BEGIN BATCH
UPDATE my_table_max SET max = 5 WHERE key='test' IF max = 1;
INSERT INTO my_table_max (key, deletableMax) VALUES ('test', 5);
APPLY BATCH;
then just querying top 1 gives you the max:
select * from my_table_max limit 1;
key | deletableMax | max
------+--------------+-----
test | 5 | 5
Difference between these two would be seen after a delete:
delete from my_table_max WHERE key = 'test' and deletablemax = 5;
cqlsh:test_ks> select * from my_table_max limit 1;
key | deletablemax | max
------+--------------+-----
test | 1 | 5
Since it keeps track of all the values in order the older value is kept;

How to create a lookup table in Groovy?

I want to create a lookup table in Groovy, given a size (in this case the size is of 4):
RGGG
RRGG
RRRG
RRRR
That is in first iteration only one R should be there and size-1 times of G should be there. As per the iteration value increases R should grow and G should decrease as well. So for size 4 I will have 4 lookup values.
How one could do this in Groovy?
You mean like this:
def lut( width, a='R', b='G' ) {
(1..width).collect { n ->
( a * n ) + ( b * ( width - n ) )
}
}
def table = lut( 4 )
table.each { println it }
prints:
RGGG
RRGG
RRRG
RRRR
Your question doesn't really say what sort of data you are expecting out? This code gives a List of Strings

Resources