How to keep track of updated fields of an RDD in Spark? - apache-spark

There is an issue that i am dealing with on keeping track of updated fields in spark RDD.
assume that we have an RDD like this:
(1,2)
(2,10)
(5,9)
(3,8)
(8,15)
based on some conditions value of some keys may change. for example the value of key=2 changes from 10 to 11. then the value of a key in RDD that its value is equal to the key of updated row should be changed respectively. for example key=1 has value equal to 2, which 2 is a key in other row. because value of key=2 changes to 11. then the value of key=1 should change to 11 to. after some execution RDD looks like this:
(1,11)
(2,11)
(5,9)
(3,7)
(8,7)
is there any efficient way to implement this?

Assuming you are talking about a DStream (of RDDs). In that case you can use the updateStateByKey method.
To use updateStateByKey, you need to provide a function update(events, oldState) that takes in the events that arrived for a key and its previous state, and returns a newState to store for it.
events: is a list of events that arrived in the current batch (may be empty).
oldState: is an optional state opbject, stored withing an Option; it might be missing if there was no prevous state for the key.
newState: returned by the function, is also an Option.
The result of updateStateByKey() will be a new DStream that contains an RDD of (key, state) pairs.
Basic Example:
def myUpdate(values: Seq[Long], state: Option[Long]) = {
// select new value
}
myDStream.updateStateByKey(myUpdate _)
Background given from the book "Learning Spark" (O'Reilly).

Related

How to add column to a DataFrame where value is fetched from a map with other column from row as key

I'm new to Spark, and trying to figure out how I can add a column to a DataFrame where its value is fetched from a HashMap, where the key is another value on the same row which where the value is being set.
For example, I have a map defined as follows:
var myMap: Map<Integer,Integer> = generateMap();
I want to add a new column to my DataFrame where its value is fetched from this map, with the key a current column value. A solution might look like this:
val newDataFrame = dataFrame.withColumn("NEW_COLUMN", lit(myMap.get(col("EXISTING_COLUMN"))))
My issue with this code is that using the col function doesn't return a type of Int, like the keys in my HashMap.
Any suggestions?
I would create a dataframe from the map. Then do a join operation. It should be faster and can be reused.
A UDF (user-defined function) can also be used but they are black boxes to Catalyst, so I would be prudent in using them. Depending on where the content of the map is, it may also be complicated to pass it to a UDF.
As of the next version of Kotlin API for Apache Spark you will be able to simply create a udf which will be usable in almost this way.
val mapUDF by udf { input: Int -> myMap[input] }
dataFrame.withColumn("NEW_COLUMN", mapUDF(col("EXISTING_COLUMN")))
You need to use UDF.
val mapUDF = udf((i:Int)=>myMap.getOrElse(i,0))
val newDataFrame = dataFrame.withColumn("NEW_COLUMN", mapUDF(col("EXISTING_COLUMN")))

SPARK Combining Neighbouring Records in a text file

very new to SPARK.
I need to read a very large input dataset, but I fear the format of the input files would not be amenable to read on SPARK. Format is as follows:
RECORD,record1identifier
SUBRECORD,value1
SUBRECORD2,value2
RECORD,record2identifier
RECORD,record3identifier
SUBRECORD,value3
SUBRECORD,value4
SUBRECORD,value5
...
Ideally what I would like to do is pull the lines of the file into a SPARK RDD, and then transform it into an RDD that only has one item per record (with the subrecords becoming part of their associated record item).
So if the example above was read in, I'd want to wind up with an RDD containing 3 objects: [record1,record2,record3]. Each object would contain the data from their RECORD and any associated SUBRECORD entries.
The unfortunate bit is that the only thing in this data that links subrecords to records is their position in the file, underneath their record. That means the problem is sequentially dependent and might not lend itself to SPARK.
Is there a sensible way to do this using SPARK (and if so, what could that be, what transform could be used to collapse the subrecords into their associated record)? Or is this the sort of problem one needs to do off spark?
There is a somewhat hackish way to identify the sequence of records and sub-records. This method assumes that each new "record" is identifiable in some way.
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.expressions.Window
val df = Seq(
("RECORD","record1identifier"),
("SUBRECORD","value1"),
("SUBRECORD2","value2"),
("RECORD","record2identifier"),
("RECORD","record3identifier"),
("SUBRECORD","value3"),
("SUBRECORD","value4"),
("SUBRECORD","value5")
).toDS().rdd.zipWithIndex.map(r => (r._1._1, r._1._2, r._2)).toDF("record", "value", "id")
val win = Window.orderBy("id")
val recids = df.withColumn("newrec", ($"record" === "RECORD").cast(LongType))
.withColumn("recid", sum($"newrec").over(win))
.select($"recid", $"record", $"value")
val recs = recids.where($"record"==="RECORD").select($"recid", $"value".as("recname"))
val subrecs = recids.where($"record" =!= "RECORD").select($"recid", $"value".as("attr"))
recs.join(subrecs, Seq("recid"), "left").groupBy("recname").agg(collect_list("attr").as("attrs")).show()
This snippet will first zipWithIndex to identify each row, in order, then add a boolean column that is true every time a "record" is identified, and false otherwise. We then cast that boolean to a long, and then can do a running sum, which has the neat side-effect of essentially labeling every record and it's sub-records with a common identifier.
In this particular case, we then split to get the record identifiers, re-join only the sub-records, group by the record ids, and collect the sub-record values to a list.
The above snippet results in this:
+-----------------+--------------------+
| recname| attrs|
+-----------------+--------------------+
|record1identifier| [value1, value2]|
|record2identifier| []|
|record3identifier|[value3, value4, ...|
+-----------------+--------------------+

Inserting a value on a frozen set in cassandra 3

I am currently working on a Cassandra 3 database in which one of its tables has a column that is defined like this:
column_name map<int, frozen <set<int>>>
When I have to change the value of a complete set given a map key x I just have to do this:
UPDATE keyspace.table SET column_name[x] = {1,2,3,4,5} WHERE ...
The thing is that I need to insert a value on a set given a key. I tried with this:
UPDATE keyspace.table SET column_name[x] = column_name[x] + {1} WHERE ...
But it returns:
SyntaxException: line 1:41 no viable alternative at input '[' (... SET column_name[x] = [column_name][...)
What am I doing wrong? Does anyone know how to insert data the way I need?
Since the value of map is frozen, you can't use update like this.
A frozen value serializes multiple components into a single value. Non-frozen types allow updates to individual fields. Cassandra treats the value of a frozen type as a blob. The entire value must be overwritten.
You have to read the full map get the value of the key append new item and then reinsert

Ordered union on spark RDDs

I am trying to do a sort on key of key-record pairs using apache spark. The key is 10 bytes long and the value is about 90 bytes long. In other words I am trying to replicate the sort benchmark Databricks used to break the sorting record. One of the things I noticed from the documentation is that they sorted on key-line-number pairs as opposed to key-record pairs to probably be cache/tlb friendly. I tried to replicate this approach but have not found a suitable solution. Here is what I have tried:
var keyValueRDD_1 = input.map(x => (x.substring(0, 10), x.substring(12, 13)))
var keyValueRDD_2 = input.map(x => (x.substring(0, 10), x.substring(14, 98))
var result = keyValueRDD_1.sortByKey(true, 1) // assume partitions = 1
var unionResult = result.union(keyValueRDD_2)
var finalResult = unionResult.foldByKey("")(_+_)
When I do a union on the result RDD and keyValueRDD_2 RDD and print the output of the unionResultRDD, the result and keyValueRDD_2 are not interleaved. In other words, it looks like the unionResult RDD has the keyValueRDD_2 contents followed by the result RDD contents. However, when I do a foldByKey operation which combines the values of same key into a single key-value pair, the sorted order is destroyed. I need to do a fold by key operation in order to save the result as the original key-record pair. Is there an alternate rdd function that could be used to achieve this?
Any tips or suggestions would be quite useful.
Thanks
The union method just puts two RDDs one after the other, except if they have the same partitioner. Then it joins the partitions.
What you want to do is impossible.
When you have one RDD sorted (keyValueRDD_1) and another unsorted RDD with the same keys (keyValueRDD_2) then the only way to get the second RDD sorted is to sort it.
The existence of the sorted RDD does not help us sort the second RDD.
The Databricks article talks about an optimization that happens locally on the executors. After the shuffle step, the records are roughly sorted. Each partition now covers a range of keys, but the partitions are unsorted.
Now you have to sort each partition locally, and this is where the prefix optimization helps with cache locality.

Storing time ranges in cassandra

I'm looking for a good way to store data associated with a time range, in order to be able to efficiently retrieve it later.
Each entry of data can be simplified as (start time, end time, value). I will need to later retrieve all the entries which fall inside a (x, y) range. In SQL, the query would be something like
SELECT value FROM data WHERE starttime <= x AND endtime >= y
Can you suggest a structure for the data in Cassandra which would allow me to perform such queries efficiently?
This is an oddly difficult thing to model efficiently.
I think using Cassandra's secondary indexes (along with a dummy indexed value which is unfortunately still needed at the moment) is your best option. You'll need to use one row per event with at least three columns: 'start', 'end', and 'dummy'. Create a secondary index on each of these. The first two can be LongType and the last can be BytesType. See this post on using secondary indexes for more details. Since you have to use an EQ expression on at least one column for a secondary index query (the unfortunate requirement I mentioned), the EQ will be on 'dummy', which can always set to 0. (This means that the EQ index expression will match every row and essentially be a no-op.) You can store the rest of the event data in the row alongside start, end, and dummy.
In pycassa, a Python Cassandra client, your query would look like this:
from pycassa.index import *
start_time = 12312312000
end_time = 12312312300
start_exp = create_index_expression('start', start_time, GT)
end_exp = create_index_expression('end', end_time, LT)
dummy_exp = create_index_expression('dummy', 0, EQ)
clause = create_index_clause([start_exp, end_exp, dummy_exp], count=1000)
for result in entries.get_indexed_slices(clause):
# do stuff with result
There should be something similar in other clients.
The alternative that I considered first involved OrderPreservingPartitioner, which is almost always a Bad Thing. For the index, you would use the start time as the row key and the finish time as the column name. You could then perform a range slice with start_key=start_time and column_finish=finish_time. This would scan every row after the start time and only return those with columns before the finish_time. Not very efficient, and you have to do a big multiget, etc. The built-in secondary index approach is better because nodes will only index local data and most of the boilerplate indexing code is handled for you.

Resources