Spark streaming query loads datasource twice - apache-spark

I have implemented my own structured streaming data source in spark against a proprietary vendor messaging system. It is using V2 of the structured streaming API implementing MicroBatchReadSupport and DataSourceRegister. I modeled it much after some examples found here. I also followed the advice given at this stack overflow post. At first everything seems to be starting up properly when I call load on the readStream. However, when I try to direct the query to a writeStream, it tries to instantiate another MicroBatchReadSupport. This actually fails fast because I have a check in the createMicroBatchReader method to see if there was a schema provided, and if not throw an exception. And in the case of the second call to createMicroBatchReader, a schema isn't even provided even though the initial query did provide one. My code to start the stream (following closely examples from spark documentation) looks like the following
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object Streamer {
def main(args: Array[String]): Unit = {
val schema = StructType(
StructField("TradeId", LongType, nullable = True) ::
StructField("Source", StringTYpe, nullable = True :: Nil
)
val spark = SparkSession
.builder
.getOrCreate()
val ampsStream = spark.readStream
.format("amps")
.option("topic", "/test")
.option("server", "SOME_URL")
.schema(schema)
.load()
ampsStream.printSchema()
val query = ampsStream.writeStream.format("console").start()
query.awaitTermination()
}
}
I've put breaks and debug statements in to test, and it gets called again right when I get to the writeStream.start. As mentioned the odd thing two is the second time around the Optional variable that is passed into createMicroBatchReader is empty, where as the first call properly has the schema. Any guidance would be greatly appreciated.
EDIT: I added some debugging statements and tested this out with the above mentioned repo # https://github.com/hienluu/wikiedit-streaming and I see the exact same issue when running WikiEditSourceV2Example.scala from this repo. Not sure if this is a bug, or if me and the author of the aforementionned repo are missing something.
EDIT 2: Adding the code for the amps streaming source
import java.util.Optional
import org.apache.spark.internal.Logging
import org.apache.spark.sql.sources.DataSourceRegister
import org.apache.spark.sql.sources.v2.reader.streaming.{MicroBatchReader, Offset}
import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, MicroBatchReadSupport}
import org.apache.spark.sql.types.StructType
class AmpsStreamingSource extends DataSourceV2 with MicroBatchReadSupport with DataSourceRegister {
override def shortName(): String = "amps"
override def createMicroBatchReader(schema: Optional[StructType],
checkpointLocation: String,
options: DataSourceOptions): MicroBatchReader = {
println("AmpsStreamingSource.createMicroBatchReader was called")
if(schema.isPresent) new AmpsMicroBatchReader(schema.get, options)
throw new IllegalArgumentException("Must proivde a schema for amps stream source")
}
}
and the signature of AmpsMicroBatchReader
class AmpsMicroBatchReader(schema: StructType, options: DataSourceOptions)
extends MicroBatchReader with MessageHandler with Logging

Related

get spark error 'No Encoder found for java.io.Serializable' for case class w/ one attribute defined as 'Serializable'

When I run the code below in Failing Code I get the error message mentioned in the title, with a stack trace something like what is shown below in Error and Stack Trace.
As you can see i have tried to make an encoder available as an implicit, but perhaps I made
some mistake along the way with that.
Side note: one approach that DID WORK to some extent was to use Kryo encoder (this is currently commented out). But the downside to this is that when you do a show() on the Dataset you get a bunch of hex bytes as output.. Not so useful for debugging.
My next approach will be to try to use a Java bean instead
of case class.
I will post the results.. hopeully it will work. But I'm wondering if there is some
way to do this and retain the case class approach. Many thanks !
Error and Stack Trace
No Encoder found for java.io.Serializable
- field (class: "java.io.Serializable", name: "value")
- root class: "com.Item"
java.lang.UnsupportedOperationException: No Encoder found for java.io.Serializable
- field (class: "java.io.Serializable", name: "value")
- root class: "com.Item"
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:591)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:73)
at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904)
at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:432)
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$6(ScalaReflection.scala:577)
at scala.collection.immutable.List.map(List.scala:297)
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:562)
Failing Code
import com.typesafe.scalalogging.LazyLogging
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.{Encoder, SparkSession}
import org.testng.annotations.Test
case class Item(name: String, value: java.io.Serializable)
object TestSessionFactory {
def getSession(master: String = "local"): SparkSession = {
SparkSession.builder
.master(master)
.config("spark.driver.bindAddress", "127.0.0.1")
.appName("test").getOrCreate()
}
}
class CaseClassWithSerializableTest extends LazyLogging {
#Test
def testWrite(): Unit = {
val sparkSession: SparkSession = TestSessionFactory.getSession()
import sparkSession.implicits._
implicit val theSerializableEncoder: Encoder[java.io.Serializable] = ExpressionEncoder()
// implicit val myCaseClassEncoder: Encoder[Item] = org.apache.spark.sql.Encoders.kryo[Item]
val innerValue : java.io.Serializable = new java.lang.String("innerValue")
val innerValue2 : java.io.Serializable = new java.lang.Integer(200)
val list =
List(
Item("joe", innerValue),
Item("joe", innerValue2)
)
val dataset = list.toDS().as[Item]
dataset.printSchema()
dataset.show()
}

How to write integration tests for Sparks new Structured Streaming?

Trying to test Spark Structured Streams ...and failing... how can I test them properly?
I followed the general Spark testing question from here, and my closest try was [1] looking something like:
import simpleSparkTest.SparkSessionTestWrapper
import org.scalatest.FunSpec
import org.apache.spark.sql.types.{StringType, IntegerType, DoubleType, StructType, DateType}
import org.apache.spark.sql.streaming.OutputMode
class StructuredStreamingSpec extends FunSpec with SparkSessionTestWrapper {
describe("Structured Streaming") {
it("Read file from system") {
val schema = new StructType()
.add("station_id", IntegerType)
.add("name", StringType)
.add("lat", DoubleType)
.add("long", DoubleType)
.add("dockcount", IntegerType)
.add("landmark", StringType)
.add("installation", DateType)
val sourceDF = spark.readStream
.option("header", "true")
.schema(schema)
.csv("/Spark-The-Definitive-Guide/data/bike-data/201508_station_data.csv")
.coalesce(1)
val countSource = sourceDF.count()
val query = sourceDF.writeStream
.format("memory")
.queryName("Output")
.outputMode(OutputMode.Append())
.start()
.processAllAvailable()
assert(countSource === 70)
}
}
}
Sadly it always fails with org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
I also found this issue at the spark-testing-base repo and wonder if it is even possible to test Spark Structured Streaming?
I want to have integration test and maybe even use Kafka on top for testing Checkpointing or specific corrupt data scenarios. Can someone help me out?
Last but not least, I figured the version maybe also a constraint - I currently develop against 2.1.0 which I need because of Azure HDInsight deployment options. Self hosted is an option if this is the drag.
Did you solve this?
You are doing a count() on a streaming dataframe before starting the execution by calling start().
If you want a count, how about doing this?
sourceDF.writeStream
.format("memory")
.queryName("Output")
.outputMode(OutputMode.Append())
.start()
.processAllAvailable()
val results: List[Row] = spark.sql("select * from Output").collectAsList()
assert(results.size() === 70)
You can also use the StructuredStreamingBase trait from #holdenk testing library :
https://github.com/holdenk/spark-testing-base/blob/936c34b6d5530eb664e7a9f447ed640542398d7e/core/src/test/2.2/scala/com/holdenkarau/spark/testing/StructuredStreamingSampleTests.scala
Here's an example on how to use it :
class StructuredStreamingTests extends FunSuite with SharedSparkContext with StructuredStreamingBase {
override implicit def reuseContextIfPossible: Boolean = true
test("add 3") {
import spark.implicits._
val input = List(List(1), List(2, 3))
val expected = List(4, 5, 6)
def compute(input: Dataset[Int]): Dataset[Int] = {
input.map(elem => elem + 3)
}
testSimpleStreamEndState(spark, input, expected, "append", compute)
}}

Serialization of transform function in checkpointing

I'm trying to understand Spark Streaming's RDD transformations and checkpointing in the context of serialization. Consider the following example Spark Streaming app:
private val helperObject = HelperObject()
private def createStreamingContext(): StreamingContext = {
val conf = new SparkConf()
.setAppName(Constants.SparkAppName)
.setIfMissing("spark.master", Constants.SparkMasterDefault)
implicit val streamingContext = new StreamingContext(
new SparkContext(conf),
Seconds(Constants.SparkStreamingBatchSizeDefault))
val myStream = StreamUtils.createStream()
myStream.transform(transformTest(_)).print()
streamingContext
}
def transformTest(rdd: RDD[String]): RDD[String] = {
rdd.map(str => helperObject.doSomething(str))
}
val ssc = StreamingContext.getOrCreate(Settings.progressDir,
createStreamingContext)
ssc.start()
while (true) {
helperObject.setData(...)
}
From what I've read in other SO posts, transformTest will be invoked on the driver program once for every batch after streaming starts. Assuming createStreamingContext is invoked (no checkpoint is available), I would expect that the instance of helperObject defined up top would be serialized out to workers once per batch, hence picking up the changes applied to it via helperObject.setData(...). Is this the case?
Now, if createStreamingContext is not invoked (a checkpoint is available), then I would expect that the instance of helperObject cannot possibly be picked up for each batch, since it can't have been captured if createStreamingContext is not executed. Spark Streaming must have serialized helperObject as part of the checkpoint, correct?
Is it possible to update helperObject throughout execution from the driver program when using checkpointing? If so, what's the best approach?
If helperObject is going to be serialized to each executors?
Ans: Yes.
val helperObject = Instantiate_SomeHow()
rdd.map{_.SomeFunctionUsing(helperObject)}
Spark Streaming must have serialized helperObject as part of the checkpoint, correct?
Ans Yes.
If you wish to refresh your helperObject behaviour for each RDD operation you can still do that by making your helperObject more intelligent and not sending the helperObject directly but via a function which has the following signature () => helperObject_Class.
Since it is a function it is serializable. It is a very common design pattern used for sending objects that are not serializable e.g. database connection object or for your fun use case.
An example is given from Kafka Exactly once semantics using database
package example
import kafka.serializer.StringDecoder
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import scalikejdbc._
import com.typesafe.config.ConfigFactory
import org.apache.spark.{SparkContext, SparkConf, TaskContext}
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka.{KafkaUtils, HasOffsetRanges, OffsetRange}
/** exactly-once semantics from kafka, by storing offsets in the same transaction as the results
Offsets and results will be stored per-batch, on the driver
*/
object TransactionalPerBatch {
def main(args: Array[String]): Unit = {
val conf = ConfigFactory.load
val kafkaParams = Map(
"metadata.broker.list" -> conf.getString("kafka.brokers")
)
val jdbcDriver = conf.getString("jdbc.driver")
val jdbcUrl = conf.getString("jdbc.url")
val jdbcUser = conf.getString("jdbc.user")
val jdbcPassword = conf.getString("jdbc.password")
val ssc = setupSsc(kafkaParams, jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)()
ssc.start()
ssc.awaitTermination()
}
def setupSsc(
kafkaParams: Map[String, String],
jdbcDriver: String,
jdbcUrl: String,
jdbcUser: String,
jdbcPassword: String
)(): StreamingContext = {
val ssc = new StreamingContext(new SparkConf, Seconds(60))
SetupJdbc(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
// begin from the the offsets committed to the database
val fromOffsets = DB.readOnly { implicit session =>
sql"select topic, part, off from txn_offsets".
map { resultSet =>
TopicAndPartition(resultSet.string(1), resultSet.int(2)) -> resultSet.long(3)
}.list.apply().toMap
}
val stream: InputDStream[(String,Long)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, Long)](
ssc, kafkaParams, fromOffsets,
// we're just going to count messages per topic, don't care about the contents, so convert each message to (topic, 1)
(mmd: MessageAndMetadata[String, String]) => (mmd.topic, 1L))
stream.foreachRDD { rdd =>
// Note this block is running on the driver
// Cast the rdd to an interface that lets us get an array of OffsetRange
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// simplest possible "metric", namely a count of messages per topic
// Notice the aggregation is done using spark methods, and results collected back to driver
val results = rdd.reduceByKey {
// This is the only block of code running on the executors.
// reduceByKey did a shuffle, but that's fine, we're not relying on anything special about partitioning here
_+_
}.collect
// Back to running on the driver
// localTx is transactional, if metric update or offset update fails, neither will be committed
DB.localTx { implicit session =>
// store metric results
results.foreach { pair =>
val (topic, metric) = pair
val metricRows = sql"""
update txn_data set metric = metric + ${metric}
where topic = ${topic}
""".update.apply()
if (metricRows != 1) {
throw new Exception(s"""
Got $metricRows rows affected instead of 1 when attempting to update metrics for $topic
""")
}
}
// store offsets
offsetRanges.foreach { osr =>
val offsetRows = sql"""
update txn_offsets set off = ${osr.untilOffset}
where topic = ${osr.topic} and part = ${osr.partition} and off = ${osr.fromOffset}
""".update.apply()
if (offsetRows != 1) {
throw new Exception(s"""
Got $offsetRows rows affected instead of 1 when attempting to update offsets for
${osr.topic} ${osr.partition} ${osr.fromOffset} -> ${osr.untilOffset}
Was a partition repeated after a worker failure?
""")
}
}
}
}
ssc
}
}

Spark MiniCluster

Is it possible to create a "Spark MiniCluster" entirely programmatically to run small Spark apps from inside a Scala program? I do NOT want to start the Spark shell, but instead get a "MiniCluster" entirely fabricated in the Main of my program.
You can create application and use local master to start Spark in standalone mode:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object LocalApp {
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "local-app", new SparkConf())
// Do whatever you need
sc.stop()
}
}
You can do exactly the same thing with any supported language.

Spark: Read HBase in secured cluster

I have an easy task: I want to read HBase data in a Kerberos secured cluster.
So far I tried 2 approaches:
sc.newAPIHadoopRDD(): here I don't know how to handle the kerberos authentication
create a HBase connection from the HBase API: Here I don't really know how to convert the result into RDDs
Furthermore there seem to be some HBase-Spark connectors. But somehow I didn't really manage to find them as Maven artifact and/or they require a fixed structure of the result (but I just need to have the HBase Result object since the columns in my data are not fixed).
Do you have any example or tutorials or ....?
I appreciate any help and hints.
Thanks in advance!
I assume that you are using spark + scala +Hbase
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
object SparkWithMyTable {
def main(args: Array[String]) {
//Initiate spark context with spark master URL. You can modify the URL per your environment.
val sc = new SparkContext("spark://ip:port", "MyTableTest")
val tableName = "myTable"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "list of cluster ip's")
conf.set("hbase.zookeeper"+ ".property.clientPort","2181");
conf.set("hbase.master", "masterIP:60000");
conf.set("hadoop.security.authentication", "kerberos");
conf.set("hbase.security.authentication", "kerberos");
UserGroupInformation.setConfiguration(conf);
UserGroupInformation.loginUserFromKeytab("user#---", keyTabPath);
// Add local HBase conf
// conf.addResource(new Path("file://hbase/hbase-0.94.17/conf/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
// create my table with column family
val admin = new HBaseAdmin(conf)
if(!admin.isTableAvailable(tableName)) {
print("Creating MyTable")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("cf1".getBytes()));
admin.createTable(tableDesc)
}else{
print("Table already exists!!")
val columnDesc = new HColumnDescriptor("cf1");
admin.disableTable(Bytes.toBytes(tableName));
admin.addColumn(tableName, columnDesc);
admin.enableTable(Bytes.toBytes(tableName));
}
//first put data into table
val myTable = new HTable(conf, tableName);
for (i <- 0 to 5) {
var p = new Put();
p = new Put(new String("row" + i).getBytes());
p.add("cf1".getBytes(), "column-1".getBytes(), new String(
"value " + i).getBytes());
myTable.put(p);
}
myTable.flushCommits();
//how to create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
//get the row count
val count = hBaseRDD.count()
print("HBase RDD count:"+count)
System.exit(0)
}
}
Maven Artifact
<dependency>
<groupId>it.nerdammer.bigdata</groupId>
<artifactId>spark-hbase-connector_2.10</artifactId>
<version>1.0.3</version> // Version can be changed as per your Spark version, I am using Spark 1.6.x
</dependency>
Can also have a look at
Spark play with HBase's Result object: handling HBase KeyValue and ByteArray in Scala with Spark -- Real World Examples
scan-that-works-on-kerberos
HBaseScanRDDExample.scala

Resources