Spark: Broadcasting a multimap - apache-spark

I have a fairly small lookup file that I need to broadcast for efficiency.
If the key value pairs are unique, then you can use the following code to distribute the file as a hashmap across worker nodes.
val index_file = sc.textFile("reference.txt").map { line => ( (line.split("\t"))(1), (line.split("\t"))(0)) }
val index_map = index_file.collectAsMap()
sc.broadcast(index_map)
Unfortunately, the file has several entries for a given key. Is there any way to distribute this multimap variable? Reading the documentation, looks like collectAsMap does not support a multimap.
val mmap = new collection.mutable.HashMap[String, collection.mutable.Set[Int]]() with collection.mutable.MultiMap[String, Int]
val index_map = sc.textFile("reference.txt").map {
case line =>
val key = (line.split("\t"))(1)
val value = (line.split("\t"))(0).toInt
mmap.addBinding(key, value)
}
Now how do I broadcast index_map?

You can broadcast the map using sc.broadcast(mmap), but that simply distributes a copy of the map to your worker nodes, so that data is accessable on your worker nodes.
From your code, it looks like what you really want is to update the map from the workers, but you cannot do that. The workers do not have the same instance of the map, so they will each update their own map. What you can do instead is split the text file into key-value pairs (in parallel), then collect them and put them into the map:
val mmap = new collection.mutable.HashMap[String, collection.mutable.Set[Int]]() with collection.mutable.MultiMap[String, Int]
val index_map = sc.textFile("reference.txt")
.collect
.map (line => {
val key = (line.split("\t"))(1)
val value = (line.split("\t"))(0).toInt
mmap.addBinding(key, value)
})
To use Spark for a task where data will fit in a map seems somewhat overkill to me, though ;)

Related

Create RDD from RDD entry inside foreach loop

I have some custom logic that looks at elements in an RDD and would like to conditionally write to a TempView via the UNION approach using foreach, as per below:
rddX.foreach{ x => {
// Do something, some custom logic
...
val y = create new RDD from this RDD element x
...
or something else
// UNION to TempView
...
}}
Something really basic that I do not get:
How can convert the nth entry (x) of the RDD to an RDD itself of length 1?
Or, convert the nth entry (x) directly to a DF?
I get all the set based cases, but here I want to append when I meet a condition immediately for the sake of simplicity. I.e. at the level of the item entry in the RDD.
Now, before getting a -1 as SO 41356419, I am only suggesting this as I have a specific use case and to mutate a TempView in SPARK SQL, I do need such an approach - at least that is my thinking. Not a typical SPARK USE CASE, but that is what we are / I am facing.
Thanks in advance
First of all - you can't create RDD or DF inside foreach() of another RDD or DF/DS function. But you can get nth element from RDD and create new RDD with that single element.
EDIT:
The solution, however is much simplier:
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val n = 534 // This is input value (index of the element we'ŗe interested in)
sc.setLogLevel("ERROR")
// Creating dummy rdd
val rdd = sc.parallelize(0 to 999).cache()
val singletonRdd = rdd.zipWithIndex().filter(pair => pair._1 == n)
}
}
Hope that helps!

merge spark dStream with variable to saveToCassandra()

I have a DStream[String, Int] with pairs of word counts, e.g. ("hello" -> 10). I want to write these counts to cassandra with a step index. The index is initialized as var step = 1 and is incremented with each microbatch processed.
The cassandra table created as:
CREATE TABLE wordcounts (
step int,
word text,
count int,
primary key (step, word)
);
When trying to write the stream to the table...
stream.saveToCassandra("keyspace", "wordcounts", SomeColumns("word", "count"))
... I get java.lang.IllegalArgumentException: Some primary key columns are missing in RDD or have not been selected: step.
How can I prepend the step index to the stream in order to write the three columns together?
I'm using spark 2.0.0, scala 2.11.8, cassandra 3.4.0 and spark-cassandra-connector 2.0.0-M3.
As noted, while the Cassandra table expects something of the form (Int, String, Int), the wordCount DStream is of type DStream[(String, Int)], so for the call to saveToCassandra(...) to work, we need a DStream of type DStream[(Int, String, Int)].
The tricky part in this question is how to bring a local counter, that is by definition only known in the driver, up to the level of the DStream.
To do that, we need to do two things: "lift" the counter to a distributed level (in Spark, we mean "RDD" or "DataFrame") and join that value with the existing DStream data.
Departing from the classic Streaming word count example:
// Split each line into words
val words = lines.flatMap(_.split(" "))
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
We add a local var to hold the count of the microbatches:
#transient var batchCount = 0
It's declared transient, so that Spark doesn't try to close over its value when we declare transformations that use it.
Now the tricky bit: Within the context of a DStream transformation, we make an RDD out of that single variable and join it with underlying RDD of the DStream using cartesian product:
val batchWordCounts = wordCounts.transform{ rdd =>
batchCount = batchCount + 1
val localCount = sparkContext.parallelize(Seq(batchCount))
rdd.cartesian(localCount).map{case ((word, count), batch) => (batch, word, count)}
}
(Note that a simple map function would not work, as only the initial value of the variable would be captured and serialized. Therefore, it would look like the counter never increased when looking at the DStream data.
Finally, now that the data is in the right shape, save it to Cassandra:
batchWordCounts.saveToCassandra("keyspace", "wordcounts")
updateStateByKey function is provided by spark for global state handling.
For this case it could look something like following
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
val newCount: Int = runningCount.getOrElse(0) + 1
Some(newCount)
}
val step = stream.updateStateByKey(updateFunction _)
stream.join(step).map{case (key,(count, step)) => (step,key,count)})
.saveToCassandra("keyspace", "wordcounts")
Since you are trying to save the RDD to existing Cassandra table, you need to include all the primary key column values in the RDD.
What you can do is, you can use the below methods to save the RDD to new table.
saveAsCassandraTable or saveAsCassandraTableEx
For more info look into this.

Spark Streaming - how to use reduceByKey within a partition on the Iterator

I am trying to consume Kafka DirectStream, process the RDDs for each partition and write the processed values to DB. When I try to perform reduceByKey(per partition, that is without the shuffle), I get the following error. Usually on the driver node, we can use sc.parallelize(Iterator) to solve this issue. But I would like to solve it in spark streaming.
value reduceByKey is not a member of Iterator[((String, String), (Int, Int))]
Is there a way to perform transformations on Iterator within the partition?
myKafkaDS
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val commonIter = rdd.mapPartitionsWithIndex ( (i,iter) => {
val offset = offsetRanges(i)
val records = iter.filter(item => {
(some_filter_condition)
}).map(r1 => {
// Some processing
((field2, field2), (field3, field4))
})
val records.reduceByKey((a,b) => (a._1+b._1, a._2+b._2)) // Getting reduceByKey() is not a member of Iterator
// Code to write to DB
Iterator.empty // I just want to store the processed records in DB. So returning empty iterator
})
}
Is there a more elegant way to do this(process kafka RDDs for each partition and store them in a DB)?
So... We can not use spark transformations within mapPartitionsWithIndex. However using scala transform and reduce methods like groupby helped me solve this issue.
yours records value is a iterator and Not a RDD. Hence you are unable to invoke reduceByKey on records relation.
Syntax issues:
1)reduceByKey logic looks ok, please remove val before statement(if not typo) & attach reduceByKey() after map:
.map(r1 => {
// Some processing
((field2, field2), (field3, field4))
}).reduceByKey((a,b) => (a._1+b._1, a._2+b._2))
2)Add iter.next after end of each iteration.
3)iter.empty is wrongly placed. Put after coming out of mapPartitionsWithIndex()
4)Add iterator condition for safety:
val commonIter = rdd.mapPartitionsWithIndex ((i,iter) => if (i == 0 && iter.hasNext){
....
}else iter),true)

How to send transformed data from partitions to S3?

I have an RDD which is to big to collect. I have applied a chain of transformations to the RDD and want to send its transformed data directly from its partitions on my slaves to S3. I am currently operating as follows:
val rdd:RDD = initializeRDD
val rdd2 = rdd.transform
rdd2.first // in order to force calculation of RDD
rdd2.foreachPartition sendDataToS3
Unfortunately, the data that gets sent to S3 is untransformed. The RDD looks exactly like it did in stage initializeRDD.
Here is the body of sendDataToS3:
implicit class WriteableRDD[T](rdd:RDD[T]){
def transform:RDD[String] = rdd map {_.toString}
....
def sendPartitionsToS3(prefix:String) = {
rdd.foreachPartition { p =>
val filename = prefix+new scala.util.Random().nextInt(1000000)
val pw = new PrintWriter(new File(filename))
p foreach pw.println
pw.close
s3.putObject(S3_BUCKET, filename, new File(filename))
}
this
}
}
This is called with rdd.transform.sendPartitionsToS3(prefix).
How do I make sure the data that gets sent in sendDataToS3 is the transformed data?
My guess is there is a bug in your code that is not included in the question.
I'm answering anyway just to make sure you are aware of RDD.saveAsTextFile. You can give it a path on S3 (s3n://bucket/directory) and it will write each partition into that path directly from the executors.
I can hardly imagine when you would need to implement your own sendPartitionsToS3 instead of using saveAsTextFile.

Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

I am using Spark 1.0.1 to process a large amount of data. Each row contains an ID number, some with duplicate IDs. I want to save all the rows with the same ID number in the same location, but I am having trouble doing it efficiently. I create an RDD[(String, String)] of (ID number, data row) pairs:
val mapRdd = rdd.map{ x=> (x.split("\\t+")(1), x)}
A way that works, but is not performant, is to collect the ID numbers, filter the RDD for each ID, and save the RDD of values with the same ID as a text file.
val ids = rdd.keys.distinct.collect
ids.foreach({ id =>
val dataRows = mapRdd.filter(_._1 == id).values
dataRows.saveAsTextFile(id)
})
I also tried a groupByKey or reduceByKey so that each tuple in the RDD contains a unique ID number as the key and a string of combined data rows separated by new lines for that ID number. I want to iterate through the RDD only once using foreach to save the data, but it can't give the values as an RDD
groupedRdd.foreach({ tup =>
val data = sc.parallelize(List(tup._2)) //nested RDD does not work
data.saveAsTextFile(tup._1)
})
Essentially, I want to split an RDD into multiple RDDs by an ID number and save the values for that ID number into their own location.
I think this problem is similar to
Write to multiple outputs by key Spark - one Spark job
Please refer the answer there.
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any =
NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =
key.asInstanceOf[String]
}
object Split {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Split" + args(1))
val sc = new SparkContext(conf)
sc.textFile("input/path")
.map(a => (k, v)) // Your own implementation
.partitionBy(new HashPartitioner(num))
.saveAsHadoopFile("output/path", classOf[String], classOf[String],
classOf[RDDMultipleTextOutputFormat])
spark.stop()
}
}
Just saw similar answer above, but actually we don't need customized partitions. The MultipleTextOutputFormat will create file for each key. It is ok that multiple record with same keys fall into the same partition.
new HashPartitioner(num), where the num is the partition number you want. In case you have a big number of different keys, you can set number to big. In this case, each partition will not open too many hdfs file handlers.
you can directly call saveAsTextFile on grouped RDD, here it will save the data based on partitions, i mean, if you have 4 distinctID's, and you specified the groupedRDD's number of partitions as 4, then spark stores each partition data into one file(so by which you can have only one fileper ID) u can even see the data as iterables of eachId in the filesystem.
This will save the data per user ID
val mapRdd = rdd.map{ x=> (x.split("\\t+")(1),
x)}.groupByKey(numPartitions).saveAsObjectFile("file")
If you need to retrieve the data again based on user id you can do something like
val userIdLookupTable = sc.objectFile("file").cache() //could use persist() if data is to big for memory
val data = userIdLookupTable.lookup(id) //note this returns a sequence, in this case you can just get the first one
Note that there is no particular reason to save to the file in this case I just did it since the OP asked for it, that being said saving to a file does allow you to load the RDD at anytime after the initial grouping has been done.
One last thing, lookup is faster than a filter approach of accessing ids but if you're willing to go off a pull request from spark you can checkout this answer for a faster approach

Resources