Why number of partitions after join are different in Spark Streaming - apache-spark

val sparkConf = new SparkConf()
val streamingContext = new StreamingContext(sparkConf, Minutes(1))
var historyRdd: RDD[(String, ArrayList[String])] = streamingContext.sparkContext.emptyRDD
var historyRdd_2: RDD[(String, ArrayList[String])] = streamingContext.sparkContext.emptyRDD
val stream_1 = KafkaUtils.createDirectStream[String, GenericData.Record, StringDecoder, GenericDataRecordDecoder](streamingContext, kafkaParams , Set(inputTopic_1))
val dstream_2 = KafkaUtils.createDirectStream[String, GenericData.Record, StringDecoder, GenericDataRecordDecoder](streamingContext, kafkaParams , Set(inputTopic_2))
val dstream_2 = stream_2.map((r: Tuple2[String, GenericData.Record]) =>
{
//some mapping
}
val historyDStream = dstream_1.transform(rdd => rdd.union(historyRdd))
val historyDStream_2 = dstream_2.transform(rdd => rdd.union(historyRdd_2))
val fullJoinResult = historyDStream.fullOuterJoin(historyDStream_2)
val filtered = fullJoinResult.filter(r => r._2._1.isEmpty)
filtered.foreachRDD{rdd =>
val formatted = rdd.map(r => (r._1 , r._2._2.get))
historyRdd_2.unpersist(false) // unpersist the 'old' history RDD
historyRdd_2 = formatted // assign the new history
historyRdd_2.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
}
val filteredStream = fullJoinResult.filter(r => r._2._2.isEmpty)
filteredStream.foreachRDD{rdd =>
val formatted = rdd.map(r => (r._1 , r._2._1.get))
historyRdd.unpersist(false) // unpersist the 'old' history RDD
historyRdd = formatted // assign the new history
historyRdd.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
}
streamingContext.start()
streamingContext.awaitTermination()
}
}
DStream_1 and DStream_2 has 128 partitions each but after performing the join the resultant DStream has 3 partitions , I haven't done any repartition. I had this thought that if no. of partitions of DStream are same then the join resultant DStream has same number of partitions as join happens partition to partition.Please correct me if I am wrong in that case.

Related

how to save Iterator to ES

I use the partitionBy functions to divide my rdd to multiple partitions, and then I want to put partitions to ES.
EsSpark.saveToEs need rdd, but the partitionBy function leave me the parameter Iterator. Is there a method to save the Iterator to ES or
convert Iterator to rdd?I use the ES-spark 5.2.2
the code is below:
var entry = Array("vpn","linux","error")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
var resultRDD=stream.map( record => {
val json = parse(record.value())
val x = json.extract[vpnLogEntry]
if (!x.innerIP.equals("-")){
("vpn",x)
}else{
("linux",x)
}
})
resultRDD.foreachRDD { (rdd,durationTime) =>
val entryToIndexDis = rdd.context.broadcast(entry.zipWithIndex.toMap)
val indexToEntryDis = rdd.context.broadcast(entry.zipWithIndex.map(_.swap).toMap)
rdd.partitionBy(new Partitioner {
override def numPartitions: Int = entryToIndexDis.value.size
override def getPartition(key: Any): Int = {
entryToIndexDis.value.get(key.toString).get
}
}).mapPartitionsWithIndex((index, data) => {
val index_type = indexToEntryDis.value(index)
//here, I want to put vpn data into vpn/vpn of ES,
//and put linux data into linux/linux of ES.
//the variable of data is type of Iterator,
//so can not use EsSpark.saveToEs function
data
}, true).count()

spark streaming hbase error

I want to insert streaming data into hbase;
this is my code :
val tableName = "streamingz"
val conf = HBaseConfiguration.create()
conf.addResource(new Path("file:///opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/etc/hbase/conf.dist/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
print("-----------------------------------------------------------------------------------------------------------")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("z1".getBytes()))
tableDesc.addFamily(new HColumnDescriptor("z2".getBytes()))
admin.createTable(tableDesc)
} else {
print("Table already exists!!--------------------------------------------------------------------------------------")
}
val ssc = new StreamingContext(sc, Seconds(10))
val topicSet = Set("fluxAstellia")
val kafkaParams = Map[String, String]("metadata.broker.list" - > "10.32.201.90:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
val lines = stream.map(_._2).map(_.split(" ", -1)).foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
val myTable = new HTable(conf, tableName)
rdd.map(rec => {
var put = new Put(rec._1.getBytes)
put.add("z1".getBytes(), "name".getBytes(), Bytes.toBytes(rec._2))
myTable.put(put)
}).saveAsNewAPIHadoopDataset(conf)
myTable.flushCommits()
} else {
println("rdd is empty")
}
})
ssc.start()
ssc.awaitTermination()
}
}
I got this error:
:66: error: value _1 is not a member of Array[String]
var put = new Put(rec._1.getBytes)
I'm beginner so how I can't fix this error, and I have a question:
where exactly create the table; outside the streaming process or inside ?
Thank you
You error is basically on line var put = new Put(rec._1.getBytes)
You can call _n only on a Map(_1 for key and _2 for value) or a Tuple.
rec is a String Array you got by splitting the string in the stream by space characters. If you were after first element, you'd write it as var put = new Put(rec(0).getBytes). Likewise in the next line you'd write it as put.add("z1".getBytes(), "name".getBytes(), Bytes.toBytes(rec(1)))

How to perform multi threading or parallel processing in spark implemented in scala

Hi am having a spark streaming program which is reading the events from eventhub and pushing it topics. for processing each batch it is taking almost 10 times the batch time.
when am trying to implement multithreading am not able to see much difference in the processing time.
Is there any way by which I can increase the performance either by doing parallel processing. or start some 1000 threads at a time and just keep pushing the messages.
class ThreadExample(msg:String) extends Thread{
override def run {
var test = new PushToTopicDriver(msg)
test.push()
// println(msg)
}
}
object HiveEventsDirectStream {
def b2s(a: Array[Byte]): String = new String(a)
def main(args: Array[String]): Unit = {
val eventhubnamespace = "namespace"
val progressdir = "/Event/DirectStream/"
val eventhubname_d = "namespacestream"
val ehParams = Map[String, String](
"eventhubs.policyname" -> "PolicyKeyName",
"eventhubs.policykey" -> "key",
"eventhubs.namespace" -> "namespace",
"eventhubs.name" -> "namespacestream",
"eventhubs.partition.count" -> "30",
"eventhubs.consumergroup" -> "$default",
"eventhubs.checkpoint.dir" -> "/EventCheckpoint_0.1",
"eventhubs.checkpoint.interval" -> "2"
)
println("testing spark")
val conf = new SparkConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").setMaster("local[4]").setAppName("Eventhubs_Test")
conf.registerKryoClasses(Array(classOf[PublishToTopic]))
conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
val sc= new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val pool:ExecutorService=Executors.newFixedThreadPool(30)
val ssc = new StreamingContext(sc, Seconds(2))
var dataString :RDD[String] =sc.emptyRDD
val stream=EventHubsUtils.createDirectStreams(ssc,eventhubnamespace,progressdir,Map(eventhubname_d -> ehParams))
val kv1 = stream.map(receivedRecord => (new String(receivedRecord.getBody))).persist()
kv1.foreachRDD(rdd_1 => rdd_1.foreachPartition(line => line.foreach(msg => {var t1 = new ThreadExample(msg) t1.start()})))
ssc.start()
ssc.awaitTermination()
}
}
Thanks,
Ankush Reddy.

Convert a RDD into DataFrame after foreachRDD operation

I am processing logs which using Spark Streaming. I parse the log and convert the logs into Java Map. Following is the code.
Now I want to convert this Map into DataFrames
Any suggestion how achieve this?
val sparkConf = new SparkConf().setAppName("StreamingApp").setMaster("local[2]")
sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
sqlContext= new SQLContext(sc)
val lines = ssc.textFileStream("hdfs://localhost:9000/test")
process(lines)
def process(lines: DStream[String]) {
val maptorow = lines.foreachRDD(rdd=>{
rdd.map(line => getMap(line))
.map(p =>
Row(p.get("column1"),
p.get("column2"))
}) // how to get dataframe after this?
def getMap(logs: String): java.util.Map[String, Object] = {
val k : java.util.Map[String, String] = parseLog(logs)
}
}
Thanks
foreachRDD has no return type, hence, you shouldn't be saving maptorow and in order for you to convert it, you need to do the conversion inside the foreachRDD and then deal with each RDD by itself as a separate set of data
val sqlContext = new SQLContext(sparkContext)
lines.foreachRDD(rdd=>{
import sqlContext.implicits._
val maptorow = lines.foreachRDD(rdd=>{
val newRDD = rdd.map(line => getMap(line))
.map(p =>
Row(p.get("column1"),
p.get("column2"))
val myDataFrame = newRDD.toDF()
//process myDataFrame as a DF
})

Kafka spark directStream can not get data

I'm using spark directStream api to read data from Kafka. My code as following please:
val sparkConf = new SparkConf().setAppName("testdirectStreaming")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
val kafkaParams = Map[String, String](
"auto.offset.reset" -> "smallest",
"metadata.broker.list"->"10.0.0.11:9092",
"spark.streaming.kafka.maxRatePerPartition"->"100"
)
//I set all of the 3 partitions fromOffset are 0
var fromOffsets:Map[TopicAndPartition, Long] = Map(TopicAndPartition("mytopic",0) -> 0)
fromOffsets+=(TopicAndPartition("mytopic",1) -> 0)
fromOffsets+=(TopicAndPartition("mytopic",2) -> 0)
val kafkaData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, MessageAndMetadata[String, String]](
ssc, kafkaParams, fromOffsets,(mmd: MessageAndMetadata[String, String]) => mmd)
var offsetRanges = Array[OffsetRange]()
kafkaData.transform { rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}.map {
_.message()
}.foreachRDD { rdd =>
for (o <- offsetRanges) {
println(s"---${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
rdd.foreachPartition{ partitionOfRecords =>
partitionOfRecords.foreach { line =>
println("===============value:"+line)
}
}
}
I'm sure there are data in the kafka cluster, but my code could not get any of them. Thanks in advance.
I found the reason: The old messages in kafka have already been deleted since the retention period expired. So when I set the fromOffset is 0 it caused OutOfOffSet exception. The exception caused Spark reset the offset with the latest ones. Therefore I could not get any messages. The solution is that I need to set the appropriate fromOffset to avoid the Exception.

Resources