How to join dstream and JDBCRDD with checkpointing enabled? - apache-spark

We have a spark streaming job with checkpoint enabled, it executes correctly first time, but throw below exception when restarted from checkpoint.
org.apache.spark.SparkException: RDD transformations and actions can
only be invoked by the driver, not inside of other transformations;
for example, rdd1.map(x => rdd2.values.count() * x) is invalid because
the values transformation and count action cannot be performed inside
of the rdd1.map transformation. For more information, see SPARK-5063.
at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
at org.apache.spark.rdd.RDD.union(RDD.scala:565)
at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
Please suggest any workaround for this issue.
Sample app below:
String URL = "jdbc:oracle:thin:" + USERNAME + "/" + PWD + "#//" + CONNECTION_STRING;
Map<String, String> options = ImmutableMap.of(
"driver", "oracle.jdbc.driver.OracleDriver",
"url", URL,
"dbtable", "READINGS_10K",
"fetchSize", "10000");
DataFrame OracleDB_DF = sqlContext.load("jdbc", options);
JavaPairRDD<String, Row> OracleDB_RDD = OracleDB_DF.toJavaRDD()
.mapToPair(x -> new Tuple2(x.getString(0), x));
Dstream.transformToPair(rdd ->
rdd.mapToPair(record ->
new Tuple2<>(record.getKey().toString(), record))
.join(OracleDB_RDD)) // <-- PairRDD.join inside DStream transformation
.print();
Spark version 1.6, running in yarn cluster mode.

Let me start with the question I'm sure you must've already been asking yourself too.
How big is the OracleDB_RDD?
If it's small enough it could act as a fact table and could be broadcast first. That in turn would make your solution not only working but also efficient.
(That's why working with Spark SQL 2.0 these days makes this and similar questions obsolete as that's the sort of optimizations of the query optimizer).
If it's large, you have to create the DataFrame inside foreach action (as described in DataFrame and SQL Operations or create your own DStream to return a RDD for a join between DStreams (see ConstantInputDStream).

Related

Spark mapPartitions Issue

I am using spark mapPartition on my DF and the use case i should submit one Job (either calling lambda or sending a SQS Message) for each Partition.
I am partitioning on a custom formatted date column and logging the no.of partitions before and after and it is working as expected.
How ever when i see the total no.of jobs it is more than the no.of partitions. For Some of the partitions there are two or three jobs !!
Here is the Code i am using
val yearMonthQueryRDD = yearMonthQueryDF.rdd.mapPartitions(
partition => {
val partitionObjectList = new java.util.ArrayList[String]()
logger.info("partitionIndex = {}",TaskContext.getPartitionId());
val partitionCounter:AtomicLong = new AtomicLong(0)
val partitionSize:AtomicLong = new AtomicLong(0)
val paritionColumnName:AtomicReference[String] = new AtomicReference[String]();
// Iterate the Objects in a given parittion
val updatedPartition = partition.map( record => {
import yearMonthQueryDF.sparkSession.implicits._
partitionCounter.set(partitionCounter.get()+1)
val recordSizeInt = Integer.parseInt(record.getAs("object_size"))
val recordSize:Long = recordSizeInt.toLong
partitionObjectList.add(record.getAs("object_key"))
paritionColumnName.set(record.getAs("partition_column_name"))
record
}
).toList
logger_ref.info("No.of Elements in Partition ["+paritionColumnName.get()+"] are =["+partitionCounter.get()+"] Total Size=["+partitionSize.get()+"]")
// Submit a Job for the parition
// jobUtil.submitJob(paritionColumnName.get(),partitionObjectList,partitionSize.get())
updatedPartition.toIterator
}
)
Another thing that is making the debugging harder is the logging statements inside the mapPartitions() method are not found in the container error logs (since they are executed on each worker node not on master node i expected them to find them in container logs rather than in master node logs. Need to figure why i am only seeing stderr logs but not stdout logs on the containers though).
Thanks
Sateesh

How to use spark to write to HBase using multi-thread

I'm using spark to write data to HBase, but at the writing stage, only one executor and one core are executing.
I wonder why my code is not writing properly or what should I do to make it write faster?
Here is my code:
val df = ss.sql("SQL")
HBaseTableWriterUtil.hbaseWrite(ss, tableList, df)
def hbaseWrite(ss:SparkSession,tableList: List[String], df:DataFrame): Unit ={
val tableName = tableList(0)
val rowKeyName = tableList(4)
val rowKeyType = tableList(5)
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, s"${tableName}")
//写入到HBase
val sc = ss.sparkContext
sc.hadoopConfiguration.addResource(hbaseConf)
val columns = df.columns
val result = df.rdd.mapPartitions(par=>{
par.map(row=>{
var rowkey:String =""
if("String".equals(rowKeyType)){
rowkey = row.getAs[String](rowKeyName)
}else if("Long".equals(rowKeyType)){
rowkey = row.getAs[Long](rowKeyName).toString
}
val put = new Put(Bytes.toBytes(rowkey))
for(name<-columns){
var value = row.get(row.fieldIndex(name))
if(value!=null){
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes(name),Bytes.toBytes(value.toString))
}
}
(new ImmutableBytesWritable,put)
})
})
val job = Job.getInstance(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
result.saveAsNewAPIHadoopDataset(job.getConfiguration)
}
You may not control how many parallel execute may write to HBase.
Though you can start multiple Spark jobs in multiThreaded client program.
e.g. You can have a shell script which triggers multiple spark-submit command to induce parallelism. Each spark job can work on one set of data independent to each other and push into HBase.
This can also be done using Spark Java/Scala SparkLauncher API using it with Java concurrent API (e.g. Executor framework).
val sparkLauncher = new SparkLauncher
//Set Spark properties.only Basic ones are shown here.It will be overridden if properties are set in Main class.
sparkLauncher.setSparkHome("/path/to/SPARK_HOME")
.setAppResource("/path/to/jar/to/be/executed")
.setMainClass("MainClassName")
.setMaster("MasterType like yarn or local[*]")
.setDeployMode("set deploy mode like cluster")
.setConf("spark.executor.cores","2")
// Lauch spark application
val sparkLauncher1 = sparkLauncher.startApplication()
//get jobId
val jobAppId = sparkLauncher1.getAppId
//Get status of job launched.THis loop will continuely show statuses like RUNNING,SUBMITED etc.
while (true) {
println(sparkLauncher1.getState().toString)
}
However, the challenge is to track each of them for failure and automatic recovery. It may be tricky specially when partial data is already written into HBase. i.e. A job fails to process the complete set of data assigned to it. You may have to automatically clean the data from HBase before automatically retrigger.

In Apache Spark, how to make a task to always execute on the same machine?

In its simplest form, RDD is merely a placeholder of chained computations that can be arbitrarily scheduled to be executed on any machine:
val src = sc.parallelize(0 to 1000)
val rdd = src.mapPartitions { itr =>
Iterator(SparkEnv.get.executorId)
}
for (i <- 1 to 3) {
val vs = rdd.collect()
println(vs.mkString)
}
/* yielding:
1230123012301230
0321032103210321
2130213021302130
*/
This behaviour can obviously be overridden by making any of the upstream RDD persisted, such that Spark scheduler will minimise redundant computation:
val src = sc.parallelize(0 to 1000)
src.persist()
val rdd = src.mapPartitions { itr =>
Iterator(SparkEnv.get.executorId)
}
for (i <- 1 to 3) {
val vs = rdd.collect()
println(vs.mkString)
}
/* yield:
2013201320132013
2013201320132013
2013201320132013
each partition has a fixed executorID
*/
Now my problem is :
I don't like the vanilla caching mechanism (see this post: In Apache Spark, can I incrementally cache an RDD partition?) and have wrote my own caching mechanism (by implementing a new RDD). Since the new caching mechanism is only capable of reading existing values from local disk/memory, if there are multiple executors, my cache for each partition will be frequently missed every time the partition is executed in a task on another machine.
So my question is :
How do I mimic Spark RDD persistent implementation to ask the DAG scheduler to enforce/suggest locality aware task scheduling? Without actually calling the .persist() method, because it is unnecessary.

Multiple operations/aggregations on the same Dataframe/Dataset in Spark Structured Streaming

I use Spark 2.3.2.
I'm receiving data from Kafka. I must do multiple aggregations on the same data. Then all aggregations results will go to the same database (columns or tables may be changed). For example:
val kafkaSource = spark.readStream.option("kafka") ...
val agg1 = kafkaSource.groupBy().agg ...
val agg2 = kafkaSource.groupBy().mapgroupswithstate() ...
val agg3 = kafkaSource.groupBy().mapgroupswithstate() ...
But when I try call writeStream for each aggregation result:
aggr1.writeStream().foreach().start()
aggr2.writeStream().foreach().start()
aggr3.writeStream().foreach().start()
Spark receives data independently in each writeStream. Is this way efficient?
Can I do multiple aggregations with one writeStream? If it is possible, this way is efficient?
Every “writestream” operation results in a new streaming query. Every streaming query will read from the source and execute the entire query plan. Unlike DStream, there is no cache/persist option available.
In spark 2.4, a new API “forEachBatch” has been introduced to solve these kind of scenarios in a more efficient manner.
Caching can be used to avoid multiple reads:
kafkaSource.writeStream.foreachBatch((df, id) => {
df.persist()
val agg1 = df.groupBy().agg ...
val agg2 = df.groupBy().mapgroupswithstate() ...
val agg3 = df.groupBy().mapgroupswithstate() ...
df.unpersist()
}).start()

How to write DataFrame (built from RDD inside foreach) to Kafka?

I'm trying to write a DataFrame from Spark to Kafka and I couldn't find any solution out there. Can you please show me how to do that?
Here is my current code:
activityStream.foreachRDD { rdd =>
val activityDF = rdd
.toDF()
.selectExpr(
"timestamp_hour", "referrer", "action",
"prevPage", "page", "visitor", "product", "inputProps.topic as topic")
val producerRecord = new ProducerRecord(topicc, activityDF)
kafkaProducer.send(producerRecord) // <--- this shows an error
}
type mismatch; found : org.apache.kafka.clients.producer.ProducerRecord[Nothing,org‌​.apache.spark.sql.Da‌​taFrame] (which expands to) org.apache.kafka.clients.producer.ProducerRecord[Nothing,org‌​.apache.spark.sql.Da‌​taset[org.apache.spa‌​rk.sql.Row]] required: org.apache.kafka.clients.producer.ProducerRecord[Nothing,Str‌​ing] Error occurred in an application involving default arguments.
Do collect on the activityDF to get the records (not Dataset[Row]) and save them to Kafka.
Note that you'll end up with a collection of records after collect so you probably have to iterate over it, e.g.
val activities = activityDF.collect()
// the following is pure Scala and has nothing to do with Spark
activities.foreach { a: Row =>
val pr: ProducerRecord = // map a to pr
kafkaProducer.send(pr)
}
Use pattern matching on Row to destructure it to fields/columns, e.g.
activities.foreach { case Row(timestamp_hour, referrer, action, prevPage, page, visitor, product, topic) =>
// ...transform a to ProducerRecord
kafkaProducer.send(pr)
}
PROTIP: I'd strongly suggest using a case class and transform DataFrame (= Dataset[Row]) to Dataset[YourCaseClass].
See Spark SQL's Row and Kafka's ProducerRecord docs.
As Joe Nate pointed out in the comments:
If you do "collect" before writing to any endpoint, it's going to make all the data aggregate at the driver and then make the driver write it out. 1) Can crash the driver if too much data (2) no parallelism in write.
That's 100% correct. I wished I had said it :)
You may want to use the approach as described in Writing Stream Output to Kafka instead.

Resources