I use spark streaming to receive Kafka's data like this:
val conf = new SparkConf()
conf.setMaster("local[*]").setAppName("KafkaStreamExample")
.setSparkHome("/home/kufu/spark/spark-1.5.2-bin-hadoop2.6")
.setExecutorEnv("spark.executor.extraClassPath","target/scala-2.11/sparkstreamexamples_2.11-1.0.jar")
val threadNum = 3
val ssc = new StreamingContext(conf, Seconds(2))
val topicMap = Map(consumeTopic -> 1)
val dataRDDs:IndexedSeq[InputDStream[(String, String)]] = approachType match {
case KafkaStreamJob.ReceiverBasedApproach =>
(1 to threadNum).map(_=>
KafkaUtils.createStream(ssc, zkOrBrokers, "testKafkaGroupId", topicMap))
case KafkaStreamJob.DirectApproach =>
(1 to threadNum).map(_=>
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, Map[String, String]("metadata.broker.list" -> zkOrBrokers),
Set[String](consumeTopic)))
}
//dataRDDs.foreach(_.foreachRDD(genProcessing(approachType)))
val dataRDD = ssc.union(dataRDDs)
dataRDD.foreachRDD(genProcessing(approachType))
ssc.start()
ssc.awaitTermination()
the genProcessing generates a process to deal with the data, which will takes 5s(sleep 5s). Code are like this:
def eachRDDProcessing(rdd:RDD[(String, String)]):Unit = {
if(count>max) throw new Exception("Stop here")
println("--------- num: "+count+" ---------")
val batchNum = count
val curTime = System.currentTimeMillis()
Thread.sleep(5000)
val family = approachType match{
case KafkaStreamJob.DirectApproach => KafkaStreamJob.DirectFamily
case KafkaStreamJob.ReceiverBasedApproach => KafkaStreamJob.NormalFamily
}
val families = KafkaStreamJob.DirectFamily :: KafkaStreamJob.NormalFamily :: Nil
val time = System.currentTimeMillis().toString
val messageCount = rdd.count()
rdd.foreach(tuple => {
val hBaseConn = new HBaseConnection(KafkaStreamJob.rawDataTable,
KafkaStreamJob.zookeeper, families)
hBaseConn.openOrCreateTable()
val puts = new java.util.ArrayList[Put]()
val strs = tuple._2.split(":")
val row = strs(1) + ":" + strs(0) + ":" + time
val put = new Put(Bytes.toBytes(row))
put.add(Bytes.toBytes(family), Bytes.toBytes(KafkaStreamJob.tableQualifier),
Bytes.toBytes("batch " + batchNum.toString + ":" + strs(1)))
puts.add(put)
hBaseConn.puts(puts)
hBaseConn.close()
})
count+=1
println("--------- add "+messageCount+" messages ---------")
}
eachRDDProcessing
but the spark streaming doesn't start multi-thread.Tasks were processed one by one, and each task took around 5s. My machine has 8 cores, and the spark run on one node.
I don't spark streaming will start threads especially on driver. The point is if you have multiple nodes, your genProcessing will run on different nodes.
Further, if you call rdd.foreachPartition(...), suppose it should get better parallelism
Related
I am reading data from a file and have reached to a point where the datatype is Iterator[char]. Is there a way to transform Iterator[char] to RDD[String]? which then I can transform to Dataframe/Dataset using case class.
Below is the code:
val fileDir = "inputFileName"
val result = IOUtils.toByteArray(new FileInputStream (new File(fileDir)))
val remove_comp = result.grouped(171).map{arr => arr.update(2, 32);arr}.flatMap{arr => arr.update(3, 32); arr}
val convert_char = remove_comp.map( _.toChar)
This return convert_char: Iterator[Char] = non-empty iterator
Thanks
Not sure what you are trying to do, but this should answer your question:
val ic: Iterator[Char] = ???
val spark : SparkSession = ???
val rdd: RDD[String] = spark.sparkContext.parallelize(ic.map(_.toString).toSeq)
def textfile={
val ssc = new StreamingContext(conf, Seconds(10))
val lines = ssc.textFileStream("hdfs://master:9000/streaming/")
val words = lines.flatMap(_.split("\\s"));
val pairs = words.map(word => (word, 1));
val wordCounts = pairs.reduceByKey(_ + _);
wordCounts.print();
ssc.start();
ssc.awaitTermination();
}
The results do not show up
textFileStream only scans the new files after you start the streaming application. If you want to scan the existing files, you can use the following workaround:
fileStream[LongWritable, Text, TextInputFormat](
directory,
filter = path => !path.getName().startsWith("."),
newFilesOnly = false).map(_._2.toString)
I have a question regarding Mechanics of Spark Dataframereader. I will appreciate if anybody can help me. Let me explain the Scenario here
I am creating a DataFrame from Dstream like this. This in Input Data
var config = new HashMap[String,String]();
config += ("zookeeper.connect" ->zookeeper);
config += ("partition.assignment.strategy" ->"roundrobin");
config += ("bootstrap.servers" ->broker);
config += ("serializer.class" -> "kafka.serializer.DefaultEncoder");
config += ("group.id" -> "default");
val lines = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc,config.toMap,Set(topic)).map(_._2)
lines.foreachRDD { rdd =>
if(!rdd.isEmpty()){
val rddJson = rdd.map { x => MyFunctions.mapToJson(x) }
val sqlContext = SQLContextSingleton.getInstance(ssc.sparkContext)
val rddDF = sqlContext.read.json(rddJson)
rddDF.registerTempTable("inputData")
val dbDF = ReadDataFrameHelper.readDataFrameHelperFromDB(sqlContext, jdbcUrl, "ABCD","A",numOfPartiton,lowerBound,upperBound)
Here is the code of ReadDataFrameHelper
def readDataFrameHelperFromDB(sqlContext:HiveContext,jdbcUrl:String,dbTableOrQuery:String,
columnToPartition:String,numOfPartiton:Int,lowerBound:Int,highBound:Int):DataFrame={
val jdbcDF = sqlContext.read.jdbc(url = jdbcUrl, table = dbTableOrQuery,
columnName = columnToPartition,
lowerBound = lowerBound,
upperBound = highBound,
numPartitions = numOfPartiton,
connectionProperties = new java.util.Properties()
)
jdbcDF
}
Lastly I am doing a Join like this
val joinedData = rddDF.join(dbDF,rddDF("ID") === dbDF("ID")
&& rddDF("CODE") === dbDF("CODE"),"left_outer")
.drop(dbDF("code"))
.drop(dbDF("id"))
.drop(dbDF("number"))
.drop(dbDF("key"))
.drop(dbDF("loaddate"))
.drop(dbDF("fid"))
joinedData.show()
My input DStream will have 1000 rows and data will contains million of rows. So when I do this join, will spark load all the rows from database and read those rows or will this just read those specific rows from DB which have the code,id from the input DStream
As specified by zero323, i have also confirmed that data will be read full from the table. I checked the DB session logs and saw that whole dataset is getting loaded.
Thanks zero323
I am a beginner of Apache Spark. I want to filter two RDD into result RDD with the below code
def runSpark(stList:List[SubStTime],icList:List[IcTemp]): Unit ={
val conf = new SparkConf().setAppName("OD").setMaster("local[*]")
val sc = new SparkContext(conf)
val st = sc.parallelize(stList).map(st => ((st.productId,st.routeNo),st)).groupByKey()
val ic = sc.parallelize(icList).map(ic => ((ic.productId,ic.routeNo),ic)).groupByKey()
//TODO
//val result = st.join(ic).mapValues( )
sc.stop()
}
here is what i want to do
List[ST] ->map ->Map(Key,st) ->groupByKey ->Map(Key,List[st])
List[IC] ->map ->Map(Key,ic) ->groupByKey ->Map(Key,List[ic])
STRDD join ICRDD get Map(Key,(List[st],List[ic]))
I have a function compare listST and listIC get the List[result] result contains both SubStTime and IcTemp information
def calcIcSt(st:List[SubStTime],ic:List[IcTemp]): List[result]
I don't know how to use mapvalues or other some way to get my result
Thanks
val result = st.join(ic).mapValues( x => calcIcSt(x._1,x._2) )
What is the best way to maintain application state in a spark streaming application?
I know of two ways :
use "Union" operation to append to the lookup RDD and persist it after each union.
save the state in a file or database and load it in the start of each batch.
My question is from the performance perspective which one is better ? Also, is there a better way to do this?
You should really be using mapWithState(spec: StateSpec[K, V, StateType, MappedType]) as follows:
import org.apache.spark.streaming.{ StreamingContext, Seconds }
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
// checkpointing is mandatory
ssc.checkpoint("_checkpoints")
val rdd = sc.parallelize(0 to 9).map(n => (n, n % 2 toString))
import org.apache.spark.streaming.dstream.ConstantInputDStream
val sessions = new ConstantInputDStream(ssc, rdd)
import org.apache.spark.streaming.{State, StateSpec, Time}
val updateState = (batchTime: Time, key: Int, value: Option[String], state: State[Int]) => {
println(s">>> batchTime = $batchTime")
println(s">>> key = $key")
println(s">>> value = $value")
println(s">>> state = $state")
val sum = value.getOrElse("").size + state.getOption.getOrElse(0)
state.update(sum)
Some((key, value, sum)) // mapped value
}
val spec = StateSpec.function(updateState)
val mappedStatefulStream = sessions.mapWithState(spec)
mappedStatefulStream.print()