Invoking a java function after each micro-batch in Spark streaming - apache-spark

The below simple program reads from kafka stream and writes to CSV file every 5 mins, and its spark streaming. Is there a way I can invoke a Java function after micro-batch in the "DRIVER PROGRAM" ( not in executor ).
I agree its not a good practice to call the arbitrary code in the stream, but this is special case where we have low volume data. please adivse. Thanks.
public static void main(String[] args) throws Exception {
if (args.length == 0)
throw new Exception("Usage program configFilename");
String configFilename = args[0];
addShutdownHook();
ConfigLoader.loadConfig(configFilename);
sparkSession = SparkSession
.builder()
.appName(TestKafka.class.getName())
.master(ConfigLoader.getValue("master")).getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel(ConfigLoader.getValue("logLevel"));
SQLContext sqlCtx = sparkSession.sqlContext();
System.out.println("Spark context established");
DataStreamReader kafkaDataStreamReader = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", ConfigLoader.getValue("brokers"))
.option("group.id", ConfigLoader.getValue("groupId"))
.option("subscribe", ConfigLoader.getValue("topics"))
.option("failOnDataLoss", false);
Dataset<Row> rawDataSet = kafkaDataStreamReader.load();
rawDataSet.printSchema();
rawDataSet.createOrReplaceTempView("rawEventView1");
rawDataSet = rawDataSet.withColumn("rawEventValue", rawDataSet.col("value").cast("string"));
rawDataSet.printSchema();
rawDataSet.createOrReplaceTempView("eventView1");
sqlCtx.sql("select * from eventView1")
.writeStream()
.format("csv")
.option("header", "true")
.option("delimiter", "~")
.option("checkpointLocation", ConfigLoader.getValue("checkpointPath"))
.option("path", ConfigLoader.getValue("recordsPath"))
.outputMode(OutputMode.Append())
.trigger(ProcessingTime.create(Integer.parseInt(ConfigLoader.getValue("kafkaProcessingTime"))
, TimeUnit.SECONDS))
.start()
.awaitTermination();
}

You should be able to achieve this by something like this:
kafkaDataStreamReader.map{value -> mySideEffect(); value}
This will call the function mySideEffect every time a microbatch is received from kafka, how ever I would not recommend doing this, a better approach would be to watch the folder you persist the CSV it or simply checking the web ui, considering a micro batch happens every few seconds at most you will be swamped by emails. if you want to make sure the streaming application is up you can query the spark REST API every few seconds and make sure it's still up
https://spark.apache.org/docs/latest/monitoring.html

Related

Run a function after spark stream start() (process)

i'm running a spark application witch receive input every 3min, after the process, le output may be on 2 different folder depending on the result of the process.
In one case, i'have to send an email after the stream process in order to confirm that it's has finished..but spark ignore the function(my function send email) that i putted between the start() and awaittermination() in order to send an email after each stream...
Code below:
val completeExternalFile = dfToWrite
.coalesce(1)
.writeStream
.format("csv")
.outputMode("append")
.option("checkpointLocation",destFilePathFileNameMetadata + checkpointSuffix )
.option("compression","uncompressed")
.option("header","true")
.option("path", destFilePathFileNameEnrichi)
.trigger(Trigger.ProcessingTime(20.seconds))
.start()
/////
-- here i call my function..
////
val completeExternalFile1 = completeExternalFile.awaitTermination()

Spark context keeps stopping when trying to start a stream that is subscribed to an instance of cloud karafka

I'm trying to subcribe to a kafka topic that's located in the cloud(CloudKarafka). I want to write my stream to the console to test if I'm consuming the messages. However when I start my writestream it just keeps stopping my sparkcontext. I'm not sure if the connection is my problem or my code is the problem.
I have consumed from this topic before with Apache Flink and then it was working fine. One thing I noticed is that when I was connecting with Flink instead of Spark I would use the option("bootstrap.servers",...) instead of the mandatory ("kafka.bootstrap.servers",...) does this have something to do with it?
My service:
private static SparkSession spark;
public SparkService( SparkSession sparkSes) {
spark = sparkSes;
}
public void ConsumeSpark(){
Dataset<Row> dataset = spark
.readStream()
.format("kafka")
.option("security.protocol", "SASL_SSL")
.option("sasl.mechanism", "SCRAM-SHA-256")
.option("sasl.jaas.config", "org.apache.kafka.common.security.scram.ScramLoginModule required username=name password=pw;")
.option("group.id","name-spark")
.option("kafka.bootstrap.servers",brokers)
.option("subscribe","name-default")
.load();
dataset.writeStream().format("console").outputMode("append").start();
}
Main:
SparkService s = new SparkService( SparkSession
.builder()
.appName("pleasework")
.config("spark.master", "local[*]")
.getOrCreate());
I expect that my records when they get consumed just get printed in the console.
Instead it stops my spark context.
Logs:
19/08/17 12:12:36 INFO MicroBatchExecution: Starting [id = 86b7262c-f316-461b-abcc-3fb8e639d597, runId = 4881453b-530a-4093-a535-7528e86243ab]. Use file:/C:/Users/Sam/IdeaProjects/spark/checkpoints to store the query checkpoint.
19/08/17 12:12:36 INFO SparkContext: Invoking stop() from shutdown hook
19/08/17 12:12:36 INFO MicroBatchExecution: Using MicroBatchReader [KafkaV2[Subscribe[name-default]]] from DataSourceV2 named 'kafka' [org.apache.spark.sql.kafka010.KafkaSourceProvider#2ffca6f]
19/08/17 12:12:36 ERROR MicroBatchExecution: Query [id = 86b7262c-f316-461b-abcc-3fb8e639d597, runId = 4881453b-530a-4093-a535-7528e86243ab] terminated with error
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
Main.main(Main.java:9)
The currently active SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
Main.main(Main.java:9)
at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:100)
at org.apache.spark.sql.SparkSession.<init>(SparkSession.scala:91)
at org.apache.spark.sql.SparkSession.cloneSession(SparkSession.scala:256)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:268)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Thanks in advance for looking at this!
I needed to add "kafka." before all my options otherwise it won't change the consumerConfiguration
.option("kafka.security.protocol", "SASL_SSL")
Also didn't wrote the await termination
spark.streams().awaitAnyTermination();

spark 2.2 struct streaming foreach writer jdbc sink lag

i'm in a project using spark 2.2 struct streaming to read kafka msg into oracle database. the message flow into kafka is about 4000-6000 messages per second .
when using hdfs file system as sink destination ,it just works fine. when using foreach jdbc writer,it will have a huge delay over time . I think the lag is caused by foreach loop .
the jdbc sink class(stand alone class file):
class JDBCSink(url: String, user: String, pwd: String) extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {
val driver = "oracle.jdbc.driver.OracleDriver"
var connection: java.sql.Connection = _
var statement: java.sql.PreparedStatement = _
val v_sql = "insert INTO sparkdb.t_cf(EntityId,clientmac,stime,flag,id) values(?,?,to_date(?,'YYYY-MM-DD HH24:MI:SS'),?,stream_seq.nextval)"
def open(partitionId: Long, version: Long): Boolean = {
Class.forName(driver)
connection = java.sql.DriverManager.getConnection(url, user, pwd)
connection.setAutoCommit(false)
statement = connection.prepareStatement(v_sql)
true
}
def process(value: org.apache.spark.sql.Row): Unit = {
statement.setString(1, value(0).toString)
statement.setString(2, value(1).toString)
statement.setString(3, value(2).toString)
statement.setString(4, value(3).toString)
statement.executeUpdate()
}
def close(errorOrNull: Throwable): Unit = {
connection.commit()
connection.close
}
}
the sink part :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "namenode:9092").option("fetch.message.max.bytes", "50000000").option("kafka.max.partition.fetch.bytes", "50000000")
.option("subscribe", "rawdb.raw_data")
.option("startingOffsets", "latest")
.load()
.select($"value".as[Array[Byte]])
.map(avroDeserialize(_))
.filter(some logic).select(some logic)
.writeStream.format("csv").option("checkpointLocation", "/user/root/chk").option("path", "/user/root/testdir").start()
if I change the last line
.writeStream.format("csv")...
into jdbc foreach sink as following:
val url = "jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=x.x.x.x)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=fastdb)))"
val user = "user";
val pwd = "password";
val writer = new JDBCSink(url, user, pwd)
.writeStream.foreach(writer).outputMode("append").start()
the lag show up.
I guess the problem most likely caused by foreach loop mechanics-it's not in batch mode deal with like several thousands row in a batch ,as an oracle DBA either, I have fine tuned oracle database side ,mostly the database is waiting for idle events . excessive commit is trying to be avoided by setting connection.setAutoCommit(false) already,any suggestion will be much appreciate.
Although I don't have an actual profile of whats taking the longest time in your application, I would assume it is due to the fact that using ForeachWriter will effectively close and re-open your JDBC connection on each run, because that's how ForeachWriter works.
I would advise that instead of using it, write a custom Sink for JDBC where you control how the connection is opened or closed.
There is an open pull request to add a JDBC driver to Spark which you can take a peek at to see a possible approach to the implementation.
problem solved by injecting the result into another Kafka topic , then wrote another program read from the new topic write them into database on batches .
I think in next spark release,they might provide the jdbc sink and have some parameter setting batch size .
the main code is as following :
write to another topic:
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "fastdbtest")
.option("checkpointLocation", "/user/root/chk")
.start()
read the topic and write to databases,i'm using c3p0 connection pool
lines.foreachRDD(rdd => {
if (!rdd.isEmpty) {
rdd.foreachPartition(partitionRecords => {
//get a connection from connection pool
val conn = ConnManager.getManager.getConnection
val ps = conn.prepareStatement("insert into sparkdb.t_cf(ENTITYID,CLIENTMAC,STIME,FLAG) values(?,?,?,?)")
try {
conn.setAutoCommit(false)
partitionRecords.foreach(record => {
insertIntoDB(ps, record)
}
)
ps.executeBatch()
conn.commit()
} catch {
case e: Exception =>{}
// do some log
} finally {
ps.close()
conn.close()
}
})
}
})
Have you tried using a trigger?
I notice when I didn't use a trigger my Foreach Sink open and close several times the connection to the database.
writeStream.foreach(writer).start()
But when I used a trigger, the Foreach only opened and closed the connection one time, processing for example 200 queries and when the micro-batch was ended it closed the connection until a new micro batch was received.
writeStream.trigger(Trigger.ProcessingTime("3 seconds")).foreach(writer).start()
My use case is reading from a Kafka topic with only one partition, so Spark I think is using one partition. I dont know if this solution works the same with multiple Spark partitions but my conclusion here is the Foreach process all the micro-batch at a time (row by row) in the process method and doesn't call open() and close() for every row like a lot of people think.

Getting the error in Apache Spark: "No tasks have started yet"

I am a beginner at Apache Spark and I'm running it in standalone mode. The task dataframe.count() is hanging.
SparkConf conf = new SparkConf();
conf.set("spark.driver.allowMultipleContexts", "true");
conf.set("spark.executor.memory", "10g");
conf.set("spark.dirver.maxResultSize","10g");
conf.set("spark.driver.memory ", "10g");
//Initialize sparkcontext
Dataframt dt = //load data from reshift
JavaRDD<String> rdd = sc.textFile(url);
JavaPairRDD<String, String> pairRdd = rdd.mapToPair(SparkFunctionsImpl
.strToMap());
//dt.count()
//pairrdd => map => collectAsMap()
Spark task hangs at count() and collectAsMap() and doesn't proceed from there.
Looks like rdd.collectAsMap and dataframe.count() are executing in parallel and spark is hanging with none of the tasks proceeding.

Flume+Spark - Storing DStream in HDFS

I have flume stream which I want to store it in HDFS via spark . Below is spark code that I am running
object FlumePull {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println(
"Usage: FlumePollingEventCount <host> <port>")
System.exit(1)
}
val batchInterval = Milliseconds(60000)
val sparkConf = new SparkConf().setAppName("FlumePollingEventCount")
val ssc = new StreamingContext(sparkConf, batchInterval)
val stream = FlumeUtils.createPollingStream(ssc, "localhost", 9999)
stream.map(x => x + "!!!!")
.saveAsTextFiles("/user/root/spark/flume_Map_", "_Mapout")
ssc.start()
ssc.awaitTermination()
}
}
When I start my spsark streaming job , it does stores output in HDFS but output is something like this:
[root#sandbox ~]# hadoop fs -cat /user/root/spark/flume_Map_-1459450380000._Mapout/part-00000
org.apache.spark.streaming.flume.SparkFlumeEvent#1b9bd2c9!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#33fd3a48!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#35fd67a2!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#f9ed85f!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#58f4cfc!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#307373e!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#4ebbc8ff!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#a8905bb!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#29d73d64!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#71ff85b1!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#3ea261ef!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#16cbb209!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#17157890!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#29e41c7!!!!
It is storing flume event instead of data coming from Flume. How do it get data out of it?
Thanks
You need to extract the underlying buffer from the SparkFlumeEvent and save that. For example, if your event body is a String:
stream.map(x => new String(x.event.getBody.array) + "!!!!")
.saveAsTextFiles("/user/root/spark/flume_Map_", "_Mapout")

Resources