merge spark dStream with variable to saveToCassandra() - apache-spark

I have a DStream[String, Int] with pairs of word counts, e.g. ("hello" -> 10). I want to write these counts to cassandra with a step index. The index is initialized as var step = 1 and is incremented with each microbatch processed.
The cassandra table created as:
CREATE TABLE wordcounts (
step int,
word text,
count int,
primary key (step, word)
);
When trying to write the stream to the table...
stream.saveToCassandra("keyspace", "wordcounts", SomeColumns("word", "count"))
... I get java.lang.IllegalArgumentException: Some primary key columns are missing in RDD or have not been selected: step.
How can I prepend the step index to the stream in order to write the three columns together?
I'm using spark 2.0.0, scala 2.11.8, cassandra 3.4.0 and spark-cassandra-connector 2.0.0-M3.

As noted, while the Cassandra table expects something of the form (Int, String, Int), the wordCount DStream is of type DStream[(String, Int)], so for the call to saveToCassandra(...) to work, we need a DStream of type DStream[(Int, String, Int)].
The tricky part in this question is how to bring a local counter, that is by definition only known in the driver, up to the level of the DStream.
To do that, we need to do two things: "lift" the counter to a distributed level (in Spark, we mean "RDD" or "DataFrame") and join that value with the existing DStream data.
Departing from the classic Streaming word count example:
// Split each line into words
val words = lines.flatMap(_.split(" "))
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
We add a local var to hold the count of the microbatches:
#transient var batchCount = 0
It's declared transient, so that Spark doesn't try to close over its value when we declare transformations that use it.
Now the tricky bit: Within the context of a DStream transformation, we make an RDD out of that single variable and join it with underlying RDD of the DStream using cartesian product:
val batchWordCounts = wordCounts.transform{ rdd =>
batchCount = batchCount + 1
val localCount = sparkContext.parallelize(Seq(batchCount))
rdd.cartesian(localCount).map{case ((word, count), batch) => (batch, word, count)}
}
(Note that a simple map function would not work, as only the initial value of the variable would be captured and serialized. Therefore, it would look like the counter never increased when looking at the DStream data.
Finally, now that the data is in the right shape, save it to Cassandra:
batchWordCounts.saveToCassandra("keyspace", "wordcounts")

updateStateByKey function is provided by spark for global state handling.
For this case it could look something like following
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
val newCount: Int = runningCount.getOrElse(0) + 1
Some(newCount)
}
val step = stream.updateStateByKey(updateFunction _)
stream.join(step).map{case (key,(count, step)) => (step,key,count)})
.saveToCassandra("keyspace", "wordcounts")

Since you are trying to save the RDD to existing Cassandra table, you need to include all the primary key column values in the RDD.
What you can do is, you can use the below methods to save the RDD to new table.
saveAsCassandraTable or saveAsCassandraTableEx
For more info look into this.

Related

Finding multiple values using map-reduce

Let's say i have an rdd with the following schema :
(ID,VALUE_1,VALUE_2)
What i would like to do is somehow using map_reduce end up with something like :
(ID,SUM(VALUE_1),SUM(VALUE_2),rdd_size) where sum(value_1,2) is the sum of the value_1 or _2 for the whole rdd and rdd_size is the number of rows in my rdd.
So far using reduce i can easily find one of those 3 but i can't seem to end with the desired output schema.Any ideas??
Please note this is in Scala but you could do similar in PySpark as well.
Following code creates the RDD the way you have shown
scala> val list = List((1,2,3),(1,3,4),(1,10,23),(2,3,5),(2,55,6))
list: List[(Int, Int, Int)] = List((1,2,3), (1,3,4), (1,10,23), (2,3,5), (2,55,6))
scala> val rdd = sc.parallelize(list)
rdd: org.apache.spark.rdd.RDD[(Int, Int, Int)] = ParallelCollectionRDD[11] at parallelize at <console>:26
Map this RDD to output (key,value) where key is the first element in tuple ( in your case ID) and the value is Tuple3 where first element is hardcoded to 1 and rest two elements are copied from original RDD ( VALUE_1 and VALUE_2 in your example). RDD collect and println are included below for understanding. It is not advisable when you run this with real data.
scala> val rdd1 = rdd.map(x => (x._1,(1,x._2,x._3)))
rdd1: org.apache.spark.rdd.RDD[(Int, (Int, Int, Int))] = MapPartitionsRDD[8] at map at <console>:25
scala> rdd1.collect.foreach(println)
(1,(1,2,3))
(1,(1,3,4))
(1,(1,10,23))
(2,(1,3,5))
(2,(1,55,6))
groupByKey is not required in all of this but just wanted to display how the grouped RDD would look like.
scala> rdd1.groupByKey().collect.foreach(println)
(1,CompactBuffer((1,2,3), (1,3,4), (1,10,23)))
(2,CompactBuffer((1,3,5), (1,55,6)))
Run reduceByKey to arrive the output you are expecting.
You can use above groupBy output to sum the VALUE_1 and VALUE_2 to confirm results of reduceByKey are correct.
scala> rdd1.reduceByKey((a,b) => (a._1+b._1,a._2+b._2,a._3+b._3)).collect.foreach(println)
(1,(3,15,30))
(2,(2,58,11))
In the above output
Key is the ID in your example.
Value is Tuple3 where first element is number of records in that group, second element is SUM(VALUE_1) and third element is SUM(VALUE_2).
You can rearrange if you want number of records or size in your example as the last element in Tuple.

Use Spark groupByKey to dedup RDD which causes a lot of shuffle overhead

I have a key-value pair RDD. The RDD contains some elements with duplicate keys, and I want to split original RDD into two RDDs: One stores elements with unique keys, and another stores the rest elements. For example,
Input RDD (6 elements in total):
<k1,v1>, <k1,v2>, <k1,v3>, <k2,v4>, <k2,v5>, <k3,v6>
Result:
Unique keys RDD (store elements with unique keys; For the multiple elements with the same key, any element is accepted):
<k1,v1>, <k2, v4>, <k3,v6>
Duplicated keys RDD (store the rest elements with duplicated keys):
<k1,v2>, <k1,v3>, <k2,v5>
In the above example, unique RDD has 3 elements, and the duplicated RDD has 3 elements too.
I tried groupByKey() to group elements with the same key together. For each key, there is a sequence of elements. However, the performance of groupByKey() is not good because the data size of element value is very big which causes very large data size of shuffle write.
So I was wondering if there is any better solution. Or is there a way to reduce the amount of data being shuffled when using groupByKey()?
EDIT: given the new information in the edit, I would first create the unique rdd, and than the the duplicate rdd using the unique and the original one:
val inputRdd: RDD[(K,V)] = ...
val uniqueRdd: RDD[(K,V)] = inputRdd.reduceByKey((x,y) => x) //keep just a single value for each key
val duplicateRdd = inputRdd
.join(uniqueRdd)
.filter {case(k, (v1,v2)) => v1 != v2}
.map {case(k,(v1,v2)) => (k, v1)} //v2 came from unique rdd
there is some room for optimization also.
In the solution above there will be 2 shuffles (reduceByKey and join).
If we repartition the inputRdd by the key from the start, we won't need any additional shuffles
using this code should produce much better performance:
val inputRdd2 = inputRdd.partitionBy(new HashPartitioner(partitions=200) )
Original Solution:
you can try the following approach:
first count the number of occurrences of each pair, and then split into the 2 rdds
val inputRdd: RDD[(K,V)] = ...
val countRdd: RDD[((K,V), Int)] = inputRDD
.map((_, 1))
.reduceByKey(_ + _)
.cache
val uniqueRdd = countRdd.map(_._1)
val duplicateRdd = countRdd
.filter(_._2>1)
.flatMap { case(kv, count) =>
(1 to count-1).map(_ => kv)
}
Please use combineByKey resulting in use of combiner on the Map Task and hence reduce shuffling data.
The combiner logic depends on your business logic.
http://bytepadding.com/big-data/spark/groupby-vs-reducebykey/
There are multiple ways to reduce shuffle data.
1. Write less from Map task by use of combiner.
2. Send Aggregated serialized objects from Map to reduce.
3. Use combineInputFormts to enhance efficiency of combiners.

Spark: Broadcasting a multimap

I have a fairly small lookup file that I need to broadcast for efficiency.
If the key value pairs are unique, then you can use the following code to distribute the file as a hashmap across worker nodes.
val index_file = sc.textFile("reference.txt").map { line => ( (line.split("\t"))(1), (line.split("\t"))(0)) }
val index_map = index_file.collectAsMap()
sc.broadcast(index_map)
Unfortunately, the file has several entries for a given key. Is there any way to distribute this multimap variable? Reading the documentation, looks like collectAsMap does not support a multimap.
val mmap = new collection.mutable.HashMap[String, collection.mutable.Set[Int]]() with collection.mutable.MultiMap[String, Int]
val index_map = sc.textFile("reference.txt").map {
case line =>
val key = (line.split("\t"))(1)
val value = (line.split("\t"))(0).toInt
mmap.addBinding(key, value)
}
Now how do I broadcast index_map?
You can broadcast the map using sc.broadcast(mmap), but that simply distributes a copy of the map to your worker nodes, so that data is accessable on your worker nodes.
From your code, it looks like what you really want is to update the map from the workers, but you cannot do that. The workers do not have the same instance of the map, so they will each update their own map. What you can do instead is split the text file into key-value pairs (in parallel), then collect them and put them into the map:
val mmap = new collection.mutable.HashMap[String, collection.mutable.Set[Int]]() with collection.mutable.MultiMap[String, Int]
val index_map = sc.textFile("reference.txt")
.collect
.map (line => {
val key = (line.split("\t"))(1)
val value = (line.split("\t"))(0).toInt
mmap.addBinding(key, value)
})
To use Spark for a task where data will fit in a map seems somewhat overkill to me, though ;)

Vertically partition an RDD and write to separate locations

In spark 1.5+ how can I write each column of an "n"-tuple RDD to different locations?
For example if I had a RDD[(String, String)] I would like the first column to be written to s3://bucket/first-col and the second to s3://bucket/second-col
I could do the following
val pairRDD: RDD[(String, String)]
val cachedRDD = pairRDD.cache()
cachedRDD.map(_._1).saveAsTextFile("s3://bucket/first-col")
cachedRDD.map(_._2).saveAsTextFile("s3://bucket/second-col")
But is far from ideal since I need a two-pass over the RDD.
One way you could you can go about doing this is by converting the tuples into lists and then use map to create a list of RDDs and perform a save on each as follows:
val fileNames:List[String]
val input:RDD[(String, String...)] //could be a tuple of any size
val columnIDs = (1 to numCols)
val unzippedValues = input.map(_.productIterator.toList).persist() //converts tuple into list
val columnRDDs = columnIDs.map( a => unzippedValues.map(_(a)))
columnRDDs.zip(fileNames)foreach{case(b,fName) => b.saveAsTextFile(fName)}

Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

I am using Spark 1.0.1 to process a large amount of data. Each row contains an ID number, some with duplicate IDs. I want to save all the rows with the same ID number in the same location, but I am having trouble doing it efficiently. I create an RDD[(String, String)] of (ID number, data row) pairs:
val mapRdd = rdd.map{ x=> (x.split("\\t+")(1), x)}
A way that works, but is not performant, is to collect the ID numbers, filter the RDD for each ID, and save the RDD of values with the same ID as a text file.
val ids = rdd.keys.distinct.collect
ids.foreach({ id =>
val dataRows = mapRdd.filter(_._1 == id).values
dataRows.saveAsTextFile(id)
})
I also tried a groupByKey or reduceByKey so that each tuple in the RDD contains a unique ID number as the key and a string of combined data rows separated by new lines for that ID number. I want to iterate through the RDD only once using foreach to save the data, but it can't give the values as an RDD
groupedRdd.foreach({ tup =>
val data = sc.parallelize(List(tup._2)) //nested RDD does not work
data.saveAsTextFile(tup._1)
})
Essentially, I want to split an RDD into multiple RDDs by an ID number and save the values for that ID number into their own location.
I think this problem is similar to
Write to multiple outputs by key Spark - one Spark job
Please refer the answer there.
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any =
NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =
key.asInstanceOf[String]
}
object Split {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Split" + args(1))
val sc = new SparkContext(conf)
sc.textFile("input/path")
.map(a => (k, v)) // Your own implementation
.partitionBy(new HashPartitioner(num))
.saveAsHadoopFile("output/path", classOf[String], classOf[String],
classOf[RDDMultipleTextOutputFormat])
spark.stop()
}
}
Just saw similar answer above, but actually we don't need customized partitions. The MultipleTextOutputFormat will create file for each key. It is ok that multiple record with same keys fall into the same partition.
new HashPartitioner(num), where the num is the partition number you want. In case you have a big number of different keys, you can set number to big. In this case, each partition will not open too many hdfs file handlers.
you can directly call saveAsTextFile on grouped RDD, here it will save the data based on partitions, i mean, if you have 4 distinctID's, and you specified the groupedRDD's number of partitions as 4, then spark stores each partition data into one file(so by which you can have only one fileper ID) u can even see the data as iterables of eachId in the filesystem.
This will save the data per user ID
val mapRdd = rdd.map{ x=> (x.split("\\t+")(1),
x)}.groupByKey(numPartitions).saveAsObjectFile("file")
If you need to retrieve the data again based on user id you can do something like
val userIdLookupTable = sc.objectFile("file").cache() //could use persist() if data is to big for memory
val data = userIdLookupTable.lookup(id) //note this returns a sequence, in this case you can just get the first one
Note that there is no particular reason to save to the file in this case I just did it since the OP asked for it, that being said saving to a file does allow you to load the RDD at anytime after the initial grouping has been done.
One last thing, lookup is faster than a filter approach of accessing ids but if you're willing to go off a pull request from spark you can checkout this answer for a faster approach

Resources