Spark: Perform SQL function on RDD subsets - apache-spark

I need to perform SQL function on RDD subsets, increasing in size. For this I have to take subsets from input RDD with take function:
def main(args: Array[String]) {
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("Test")
.set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
val cntPairsRdd = cntsRdd.map(n => {
val sample = data0.take(n)
val dataRDD = sc.parallelize(sample)
val df = dataRDD.toDF()
val result = df.select( ...)
val xCnt = result.count
(n, xCnt)
})
}
cntsRdd is a set of increasing integers. Function take returns a list not RDD. So to make my SQL work I need first to convert my list to RDD and then to dataframe. Unfortunately inside map function Spark does not allow to create another RDD. In other words in Spark one can not create RDD inside another RDD. Because of the same reason Spark does not support SparkContext serilization. I get serilization exception when trying to sc.parallelize(sample).
Please advise some workaround to perform SQL function on RDD subsets, as defined in this scenario.

Related

Create RDD from RDD entry inside foreach loop

I have some custom logic that looks at elements in an RDD and would like to conditionally write to a TempView via the UNION approach using foreach, as per below:
rddX.foreach{ x => {
// Do something, some custom logic
...
val y = create new RDD from this RDD element x
...
or something else
// UNION to TempView
...
}}
Something really basic that I do not get:
How can convert the nth entry (x) of the RDD to an RDD itself of length 1?
Or, convert the nth entry (x) directly to a DF?
I get all the set based cases, but here I want to append when I meet a condition immediately for the sake of simplicity. I.e. at the level of the item entry in the RDD.
Now, before getting a -1 as SO 41356419, I am only suggesting this as I have a specific use case and to mutate a TempView in SPARK SQL, I do need such an approach - at least that is my thinking. Not a typical SPARK USE CASE, but that is what we are / I am facing.
Thanks in advance
First of all - you can't create RDD or DF inside foreach() of another RDD or DF/DS function. But you can get nth element from RDD and create new RDD with that single element.
EDIT:
The solution, however is much simplier:
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val n = 534 // This is input value (index of the element we'ŗe interested in)
sc.setLogLevel("ERROR")
// Creating dummy rdd
val rdd = sc.parallelize(0 to 999).cache()
val singletonRdd = rdd.zipWithIndex().filter(pair => pair._1 == n)
}
}
Hope that helps!

How to create a DStream from a List of string?

I have a list of string, but i cant find a way to change the list to a DStream of spark streaming.
I tried this:
val tmpList = List("hi", "hello")
val rdd = sqlContext.sparkContext.parallelize(Seq(tmpList))
val rowRdd = rdd.map(v => Row(v: _*))
But the eclipse says sparkContext is not a member of sqlContext, so, How can i do this?
Appreciate your help, Please.
DStream is the sequence of RDD and it is created when you have register a received to some streaming source like Kafka. For testing if you want to create DStream from list of RDD's you can do that as follows:
val rdd1 = sqlContext.sparkContext.parallelize(Seq(tmpList))
val rdd2 = sqlContext.sparkContext.parallelize(Seq(tmpList1))
ssc.queueStream[String](mutable.Queue(rdd1,rdd2))
Hope it answers your question.

Joining Kafka and Cassandra DataFrames in Spark Streaming ignores C* predicate pushdown

Intent
I'm receiving data from Kafka via direct stream and would like to enrich the messages with data from Cassandra. The Kafka messages (Protobufs) are decoded into DataFrames and then joined with a (supposedly pre-filtered) DF from Cassandra. The relation of (Kafka) streaming batch size to raw C* data is [several streaming messages to millions of C* rows], BUT the join always yields exactly ONE result [1:1] per message. After the join the resulting DF is eventually stored to another C* table.
Problem
Even though I'm joining the two DFs on the full Cassandra primary key and pushing the corresponding filter to C*, it seems that Spark is loading the whole C* data-set into memory before actually joining (which I'd like to prevent by using the filter/predicate pushdown). This leads to a lot of shuffling and tasks being spawned, hence the "simple" join takes forever...
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("test")
.set("spark.cassandra.connection.host", "xxx")
.set("spark.cassandra.connection.keep_alive_ms", "30000")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(10))
ssc.sparkContext.setLogLevel("INFO")
// Initialise Kafka
val kafkaTopics = Set[String]("xxx")
val kafkaParams = Map[String, String](
"metadata.broker.list" -> "xxx:32000,xxx:32000,xxx:32000,xxx:32000",
"auto.offset.reset" -> "smallest")
// Kafka stream
val messages = KafkaUtils.createDirectStream[String, MyMsg, StringDecoder, MyMsgDecoder](ssc, kafkaParams, kafkaTopics)
// Executed on the driver
messages.foreachRDD { rdd =>
// Create an instance of SQLContext
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
// Map MyMsg RDD
val MyMsgRdd = rdd.map{case (key, MyMsg) => (MyMsg)}
// Convert RDD[MyMsg] to DataFrame
val MyMsgDf = MyMsgRdd.toDF()
.select(
$"prim1Id" as 'prim1_id,
$"prim2Id" as 'prim2_id,
$...
)
// Load DataFrame from C* data-source
val base_data = base_data_df.getInstance(sqlContext)
// Left join on prim1Id and prim2Id
val joinedDf = MyMsgDf.join(base_data,
MyMsgDf("prim1_id") === base_data("prim1_id") &&
MyMsgDf("prim2_id") === base_data("prim2_id"), "left")
.filter(base_data("prim1_id").isin(MyMsgDf("prim1_id"))
&& base_data("prim2_id").isin(MyMsgDf("prim2_id")))
joinedDf.show()
joinedDf.printSchema()
// Select relevant fields
// Persist
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
Environment
Spark 1.6
Cassandra 2.1.12
Cassandra-Spark-Connector 1.5-RC1
Kafka 0.8.2.2
SOLUTION
From discussions on the DataStax Spark Connector for Apache Cassandra ML
Joining Kafka and Cassandra DataFrames in Spark Streaming ignores C* predicate pushdown
How to create a DF from CassandraJoinRDD
I've learned the following:
Quoting Russell Spitzer
This wouldn't be a case of predicate pushdown. This is a join on a partition key column. Currently only joinWithCassandraTable supports this direct kind of join although we are working on some methods to try to have this automatically done within Spark.
Dataframes can be created from any RDD which can have a schema applied to it. The easiest thing to do is probably to map your joinedRDD[x,y] to Rdd[JoinedCaseClass] and then call toDF (which will require importing your sqlContext implicits.) See the DataFrames documentation here for more info.
So the actual implementation now resembles something like
// Join myMsg RDD with myCassandraTable
val joinedMsgRdd = myMsgRdd.joinWithCassandraTable(
"keyspace",
"myCassandraTable",
AllColumns,
SomeColumns(
"prim1_id",
"prim2_id"
)
).map{case (myMsg, cassandraRow) =>
JoinedMsg(
foo = myMsg.foo
bar = cassandraRow.bar
)
}
// Convert RDD[JoinedMsg] to DataFrame
val myJoinedDf = joinedMsgRdd.toDF()
Have you tried joinWithCassandraTable ? It should pushdown to C* all keys you are looking for...

How two RDD according to funcation get Result RDD

I am a beginner of Apache Spark. I want to filter two RDD into result RDD with the below code
def runSpark(stList:List[SubStTime],icList:List[IcTemp]): Unit ={
val conf = new SparkConf().setAppName("OD").setMaster("local[*]")
val sc = new SparkContext(conf)
val st = sc.parallelize(stList).map(st => ((st.productId,st.routeNo),st)).groupByKey()
val ic = sc.parallelize(icList).map(ic => ((ic.productId,ic.routeNo),ic)).groupByKey()
//TODO
//val result = st.join(ic).mapValues( )
sc.stop()
}
here is what i want to do
List[ST] ->map ->Map(Key,st) ->groupByKey ->Map(Key,List[st])
List[IC] ->map ->Map(Key,ic) ->groupByKey ->Map(Key,List[ic])
STRDD join ICRDD get Map(Key,(List[st],List[ic]))
I have a function compare listST and listIC get the List[result] result contains both SubStTime and IcTemp information
def calcIcSt(st:List[SubStTime],ic:List[IcTemp]): List[result]
I don't know how to use mapvalues or other some way to get my result
Thanks
val result = st.join(ic).mapValues( x => calcIcSt(x._1,x._2) )

Read from Spark RDD a Kryo File

I'm Spark & Scala newbie.
I need to read and analyze a file in Spark that it has written in my scala code with Kryo serialized:
import com.esotericsoftware.kryo.Kryo
import com.esotericsoftware.kryo.io.Output
val kryo:Kryo = new Kryo()
val output:Output = new Output(new FileOutputStream("filename.ext",true))
//kryo.writeObject(output, feed) (tested both line)
kryo.writeClassAndObject(output, myScalaObject)
This is a pseudo-code for create a file with my object (myScalaObject) serialized, that is a complex object.
The file seems that write well, but i have problem when I read it in Spark RDD
pseudo-code in Spark:
val conf = new SparkConf()
.setMaster("local")
.setAppName("My application")
.set("spark.executor.memory", "1g")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator", "myScalaObject")
val sc = new SparkContext(conf)
val file=sc.objectFile[myScalaObject]("filename.ext")
val counts = file.count()
When I try to execute it i receive this error:
org.apache.spark.SparkException:
Job aborted: Task 0.0:0 failed 1 times (most recent failure:
Exception failure: java.io.IOException: file: filename.ext not a SequenceFile)
Is possible read this type of file in Spark?
If this solution is not possible, what is a good solution for create a complex file structure to read in Spark?
thank you
If you want to read with objectFile, write out the data with saveAsObjectFile.
val myObjects: Seq[MyObject] = ...
val rddToSave = sc.parallelize(myObjects) // Or better yet: construct as RDD from the start.
rddToSave.saveAsObjectFile("/tmp/x")
val rddLoaded = sc.objectFile[MyObject]("/tmp/x")
Alternatively, as zsxwing says, you can create an RDD of the filenames, and use map to read the contents of each. If want each file to be read into a separate partition, parallelize the filenames into separate partitions:
def loadFiles(filenames: Seq[String]): RDD[Object] = {
def load(filename: String): Object = {
val input = new Input(new FileInputStream(filename))
return kryo.readClassAndObject(input)
}
val partitions = filenames.length
return sc.parallelize(filenames, partitions).map(load)
}

Resources