Flume+Spark - Storing DStream in HDFS - apache-spark

I have flume stream which I want to store it in HDFS via spark . Below is spark code that I am running
object FlumePull {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println(
"Usage: FlumePollingEventCount <host> <port>")
System.exit(1)
}
val batchInterval = Milliseconds(60000)
val sparkConf = new SparkConf().setAppName("FlumePollingEventCount")
val ssc = new StreamingContext(sparkConf, batchInterval)
val stream = FlumeUtils.createPollingStream(ssc, "localhost", 9999)
stream.map(x => x + "!!!!")
.saveAsTextFiles("/user/root/spark/flume_Map_", "_Mapout")
ssc.start()
ssc.awaitTermination()
}
}
When I start my spsark streaming job , it does stores output in HDFS but output is something like this:
[root#sandbox ~]# hadoop fs -cat /user/root/spark/flume_Map_-1459450380000._Mapout/part-00000
org.apache.spark.streaming.flume.SparkFlumeEvent#1b9bd2c9!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#33fd3a48!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#35fd67a2!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#f9ed85f!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#58f4cfc!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#307373e!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#4ebbc8ff!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#a8905bb!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#29d73d64!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#71ff85b1!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#3ea261ef!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#16cbb209!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#17157890!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#29e41c7!!!!
It is storing flume event instead of data coming from Flume. How do it get data out of it?
Thanks

You need to extract the underlying buffer from the SparkFlumeEvent and save that. For example, if your event body is a String:
stream.map(x => new String(x.event.getBody.array) + "!!!!")
.saveAsTextFiles("/user/root/spark/flume_Map_", "_Mapout")

Related

How to pass configuration from driver to executors in Spark?

For example.
object App {
var confValue: String = ""
def main(args: Array[String]): Unit = {
// set conf by cmd args
confValue = args.head
// do some context init
val dataset: Dataset[Int] = ???
dataset.foreach { row =>
// get conf from executor
println(confValue)
}
}
}
I want to get conf on executors, but actually it can't be done because confValue only has been modified on the driver
I know I can pass confValue to executors by local variable like this.
def main(args: Array[String]): Unit = {
// set conf by cmd args
val confValue = args[0]
// do some context init
val dataset: Dataset[Int] = ???
dataset.foreach { row =>
// get conf from executor
println(confValue)
}
}
But my spark job is huge. It has so many functions. I can't pass confValue everywhere as a local variable. For example:
def main(args: Array[String]): Unit = {
// set conf by cmd args
val confValue = args[0]
// do some context init
val dataset: Dataset[Int] = ???
dataset.foreach { row =>
doSomeLogic(row)
}
}
private def doSomeLogic(row: Int): Unit = {
// get conf from executor
println(confValue)
}
There is so many doSomeLogic. So I can't pass confValue to all of them.
Is there some way to pass confValue to every executors automatically?
updated 1
My spark code may like below
object App {
/** env flag, will be inited by cmd args, and be used in executors */
var env: String = ""
val spark: SparkSession = ???
import spark.implicits._
def main(args: Array[String]): Unit = {
// read env from args
env = args.head
var ds: Dataset[Int] = ???
ds = doLogic1(ds)
ds = doLogic2(ds)
doLogic3(ds)
}
private def doLogic1(ds: Dataset[Int]): Dataset[Int] = {
ds.map { row =>
// env will be used here
???
}
}
private def doLogic2(ds: Dataset[Int]): Dataset[Int] = {
ds.map { row =>
// env will be used here
???
}
}
private def doLogic3(ds: Dataset[Int]): Dataset[Int] = {
ds.map { row =>
// env will be used here
???
}
}
}
env will be inited in main, and will be used in some of doLogicN functions. My spark project is a large project with many doLogicN functions, so passing the env flag to every doLogicN function will change too many codes.
What is the simplest way to pass the env flag to all doLogicN functions?
The most difficult point is that the env will be used in executors. If it will only be used in drivers I can pass it to everywhere by global env variable. But it won't work well in executors because the global env variable hasn't been inited. It only be inited in the driver side.
You could do something like below to broadcast the value to all the executors and then based on your requirement and can use it as you want. Also instead of using for each you should use for each partition if you want to process data for each partition in parallel.
Below is the sample code of how you can broadcast a value and use it:
//Sample data created
val df = Seq(("a","2020-01-16 08:55:50"),("b","2020-01-16 08:57:37"),("c","2020-01-16 09:00:13"),("d","2020-01-16 09:01:32"),("e","2020-01-16 09:03:32"),("f","2020-01-16 09:06:56")).toDF("ID","timestamp")
//check the partitions that a datframe has
df.rdd.partitions.size
//broadcast the value that you want to broadcast
val confValue = "Test"
val bdct_confvalue = spark.sparkContext.broadcast(confValue);
//using the broadcasted value on each executors nodes as required
df.foreachPartition(partition => {
println("Confvalue partition =" +bdct_confvalue.value)
}
)
Also to see the value printed in the logs you would have to see the executors logs and not the driver logs as you would not be able to see this print statement in the driver logs. you would also not able to see this in any notebook like Jupyter or Databricks notebook as they show driver logs on the UI.
I found a way to solve my question.
The conf can be set by Spark Conf while committing the spark job, such as spark.my.env=env_1
It can be read by SparkEnv.get.conf.get("spark.my.env"), which has the same effect between the driver and executors, after sparkContext has been initialized.

How to save Spark Stream data to file

I'm new to Spark and currently battling a problem related to save the result of a Spark Stream to file after Context time. So the issue is: I want a query to run for 60seconds and save all the input it reads in that time to a file and also be able to define the file name for future processing.
Initially I thought the code below would be the way to go:
sc.socketTextStream("localhost", 12345)
.foreachRDD(rdd -> {
rdd.saveAsTextFile("./test");
});
However, after running, I realized that not only it saved a different file for each input read - (imagine that I have random numbers generating at random pace at that port), so if in one second it read 1 the file would contain 1 number, but if it read more the file would have them, instead of writing just one file with all the values from that 60s timeframe - but also I wasn't able to name the file, since the argument inside saveAsTextFile was the desired directory.
So I would like to ask if there is any spark native solution so I don't have to solve it by "java tricks", like this:
sc.socketTextStream("localhost", 12345)
.foreachRDD(rdd -> {
PrintWriter out = new PrintWriter("./logs/votes["+dtf.format(LocalDateTime.now().minusMinutes(2))+","+dtf.format(LocalDateTime.now())+"].txt");
List<String> l = rdd.collect();
for(String voto: l)
out.println(voto + " "+dtf.format(LocalDateTime.now()));
out.close();
});
I searched the spark documentation of similar problems but was unable to find a solution :/
Thanks for your time :)
Below is the template to consume socket stream data using new Spark APIs.
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
object ReadSocket {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
//Start reading from socket
val dfStream = spark.readStream
.format("socket")
.option("host","127.0.0.1") // Replace it your socket host
.option("port","9090")
.load()
dfStream.writeStream
.trigger(Trigger.ProcessingTime("1 minute")) // Will trigger 1 minute
.outputMode(OutputMode.Append) // Batch will processed for the data arrived in last 1 minute
.foreachBatch((ds,id) => { //
ds.foreach(row => { // Iterdate your data set
//Put around your File generation logic
println(row.getString(0)) // Thats your record
})
}).start().awaitTermination()
}
}
For code explanation please read read the code inline comments
Java Version
import org.apache.spark.api.java.function.VoidFunction2;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.streaming.Trigger;
public class ReadSocketJ {
public static void main(String[] args) throws StreamingQueryException {
SparkSession spark = Constant.getSparkSess();
Dataset<Row> lines = spark
.readStream()
.format("socket")
.option("host", "127.0.0.1") // Replace it your socket host
.option("port", "9090")
.load();
lines.writeStream()
.trigger(Trigger.ProcessingTime("5 seconds"))
.foreachBatch((VoidFunction2<Dataset<Row>, Long>) (v1, v2) -> {
v1.as(Encoders.STRING())
.collectAsList().forEach(System.out::println);
}).start().awaitTermination();
}
}

spark 2.2 struct streaming foreach writer jdbc sink lag

i'm in a project using spark 2.2 struct streaming to read kafka msg into oracle database. the message flow into kafka is about 4000-6000 messages per second .
when using hdfs file system as sink destination ,it just works fine. when using foreach jdbc writer,it will have a huge delay over time . I think the lag is caused by foreach loop .
the jdbc sink class(stand alone class file):
class JDBCSink(url: String, user: String, pwd: String) extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {
val driver = "oracle.jdbc.driver.OracleDriver"
var connection: java.sql.Connection = _
var statement: java.sql.PreparedStatement = _
val v_sql = "insert INTO sparkdb.t_cf(EntityId,clientmac,stime,flag,id) values(?,?,to_date(?,'YYYY-MM-DD HH24:MI:SS'),?,stream_seq.nextval)"
def open(partitionId: Long, version: Long): Boolean = {
Class.forName(driver)
connection = java.sql.DriverManager.getConnection(url, user, pwd)
connection.setAutoCommit(false)
statement = connection.prepareStatement(v_sql)
true
}
def process(value: org.apache.spark.sql.Row): Unit = {
statement.setString(1, value(0).toString)
statement.setString(2, value(1).toString)
statement.setString(3, value(2).toString)
statement.setString(4, value(3).toString)
statement.executeUpdate()
}
def close(errorOrNull: Throwable): Unit = {
connection.commit()
connection.close
}
}
the sink part :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "namenode:9092").option("fetch.message.max.bytes", "50000000").option("kafka.max.partition.fetch.bytes", "50000000")
.option("subscribe", "rawdb.raw_data")
.option("startingOffsets", "latest")
.load()
.select($"value".as[Array[Byte]])
.map(avroDeserialize(_))
.filter(some logic).select(some logic)
.writeStream.format("csv").option("checkpointLocation", "/user/root/chk").option("path", "/user/root/testdir").start()
if I change the last line
.writeStream.format("csv")...
into jdbc foreach sink as following:
val url = "jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=x.x.x.x)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=fastdb)))"
val user = "user";
val pwd = "password";
val writer = new JDBCSink(url, user, pwd)
.writeStream.foreach(writer).outputMode("append").start()
the lag show up.
I guess the problem most likely caused by foreach loop mechanics-it's not in batch mode deal with like several thousands row in a batch ,as an oracle DBA either, I have fine tuned oracle database side ,mostly the database is waiting for idle events . excessive commit is trying to be avoided by setting connection.setAutoCommit(false) already,any suggestion will be much appreciate.
Although I don't have an actual profile of whats taking the longest time in your application, I would assume it is due to the fact that using ForeachWriter will effectively close and re-open your JDBC connection on each run, because that's how ForeachWriter works.
I would advise that instead of using it, write a custom Sink for JDBC where you control how the connection is opened or closed.
There is an open pull request to add a JDBC driver to Spark which you can take a peek at to see a possible approach to the implementation.
problem solved by injecting the result into another Kafka topic , then wrote another program read from the new topic write them into database on batches .
I think in next spark release,they might provide the jdbc sink and have some parameter setting batch size .
the main code is as following :
write to another topic:
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "fastdbtest")
.option("checkpointLocation", "/user/root/chk")
.start()
read the topic and write to databases,i'm using c3p0 connection pool
lines.foreachRDD(rdd => {
if (!rdd.isEmpty) {
rdd.foreachPartition(partitionRecords => {
//get a connection from connection pool
val conn = ConnManager.getManager.getConnection
val ps = conn.prepareStatement("insert into sparkdb.t_cf(ENTITYID,CLIENTMAC,STIME,FLAG) values(?,?,?,?)")
try {
conn.setAutoCommit(false)
partitionRecords.foreach(record => {
insertIntoDB(ps, record)
}
)
ps.executeBatch()
conn.commit()
} catch {
case e: Exception =>{}
// do some log
} finally {
ps.close()
conn.close()
}
})
}
})
Have you tried using a trigger?
I notice when I didn't use a trigger my Foreach Sink open and close several times the connection to the database.
writeStream.foreach(writer).start()
But when I used a trigger, the Foreach only opened and closed the connection one time, processing for example 200 queries and when the micro-batch was ended it closed the connection until a new micro batch was received.
writeStream.trigger(Trigger.ProcessingTime("3 seconds")).foreach(writer).start()
My use case is reading from a Kafka topic with only one partition, so Spark I think is using one partition. I dont know if this solution works the same with multiple Spark partitions but my conclusion here is the Foreach process all the micro-batch at a time (row by row) in the process method and doesn't call open() and close() for every row like a lot of people think.

Spark streaming: HBase connection is closed when use hbaseMapPartitions

in my Spark streaming application i use a HBaseContext to put some values into HBase, one put operation for each processed message.
If i use hbaseForeachPartitions, everything is ok.
dStream
.hbaseForeachPartition(
hbaseContext,
(iterator, connection) => {
val table = connection.getTable("namespace:table")
// putHBase is external function in the same Scala object
val results = iterator.flatMap(packet => putHBaseAndOther(packet))
table.close()
results
}
)
Instead with hbaseMapPartitions the connection to HBase is closed.
dStream
.hbaseMapPartition(
hbaseContext,
(iterator, connection) => {
val table = connection.getTable("namespace:table")
// putHBase is external function in the same Scala object
val results = iterator.flatMap(packet => putHBaseAndOther(packet))
table.close()
results
}
)
Someone can explain me why?

spark streaming checkpoint recovery is very very slow

Goal: Read from Kinesis and store data in to S3 in Parquet format via spark streaming.
Situation:
Application runs fine initially, running batches of 1hour and the processing time is less than 30 minutes on average. For some reason lets say the application crashes, and we try to restart from checkpoint. The processing now takes forever and does not move forward.
We tried to test out the same thing at batch interval of 1 minute, the processing runs fine and takes 1.2 minutes for batch to finish. When we recover from checkpoint it takes about 15 minutes for each batch.
Notes:
we are using s3 for checkpoints
using 1 executor, with 19g mem & 3 cores per executor
Attaching the screenshots:
First Run - Before checkpoint Recovery
Trying to Recover from checkpoint:
Config.scala
object Config {
val sparkConf = new SparkConf
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
val eventsS3Path = sc.hadoopConfiguration.get("eventsS3Path")
val useIAMInstanceRole = sc.hadoopConfiguration.getBoolean("useIAMInstanceRole",true)
val checkpointDirectory = sc.hadoopConfiguration.get("checkpointDirectory")
// sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class","org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
DateTimeZone.setDefault(DateTimeZone.forID("America/Los_Angeles"))
val numStreams = 2
def getSparkContext(): SparkContext = {
this.sc
}
def getSqlContext(): HiveContext = {
this.sqlContext
}
}
S3Basin.scala
object S3Basin {
def main(args: Array[String]): Unit = {
Kinesis.startStreaming(s3basinFunction _)
}
def s3basinFunction(streams : DStream[Array[Byte]]): Unit ={
streams.foreachRDD(jsonRDDRaw =>{
println(s"Old partitions ${jsonRDDRaw.partitions.length}")
val jsonRDD = jsonRDDRaw.coalesce(10,true)
println(s"New partitions ${jsonRDD.partitions.length}")
if(!jsonRDD.isEmpty()){
val sqlContext = SQLContext.getOrCreate(jsonRDD.context)
sqlContext.read.json(jsonRDD.map(f=>{
val str = new String(f)
if(str.startsWith("{\"message\"")){
str.substring(11,str.indexOf("#version")-2)
}
else{
str
}
})).registerTempTable("events")
sqlContext.sql(
"""
|select
|to_date(from_utc_timestamp(from_unixtime(at), 'US/Pacific')) as event_date,
|hour(from_utc_timestamp(from_unixtime(at), 'US/Pacific')) as event_hour,
|*
|from events
""".stripMargin).coalesce(1).write.mode(SaveMode.Append).partitionBy("event_date", "event_hour","verb").parquet(Config.eventsS3Path)
sqlContext.dropTempTable("events")
}
})
}
}
Kinesis.scala
object Kinesis{
def functionToCreateContext(streamFunc: (DStream[Array[Byte]]) => Unit): StreamingContext = {
val streamingContext = new StreamingContext(Config.sc, Minutes(Config.sc.hadoopConfiguration.getInt("kinesis.StreamingBatchDuration",1))) // new context
streamingContext.checkpoint(Config.checkpointDirectory) // set checkpoint directory
val sc = Config.getSparkContext
var awsCredentails : BasicAWSCredentials = null
val kinesisClient = if(Config.useIAMInstanceRole){
new AmazonKinesisClient()
}
else{
awsCredentails = new BasicAWSCredentials(sc.hadoopConfiguration.get("kinesis.awsAccessKeyId"),sc.hadoopConfiguration.get("kinesis.awsSecretAccessKey"))
new AmazonKinesisClient(awsCredentails)
}
val endpointUrl = sc.hadoopConfiguration.get("kinesis.endpointUrl")
val appName = sc.hadoopConfiguration.get("kinesis.appName")
val streamName = sc.hadoopConfiguration.get("kinesis.streamName")
kinesisClient.setEndpoint(endpointUrl)
val numShards = kinesisClient.describeStream(streamName).getStreamDescription().getShards().size
val batchInterval = Minutes(sc.hadoopConfiguration.getInt("kinesis.StreamingBatchDuration",1))
// Kinesis checkpoint interval is the interval at which the DynamoDB is updated with information
// on sequence number of records that have been received. Same as batchInterval for this
// example.
val kinesisCheckpointInterval = batchInterval
// Get the region name from the endpoint URL to save Kinesis Client Library metadata in
// DynamoDB of the same region as the Kinesis stream
val regionName = sc.hadoopConfiguration.get("kinesis.regionName")
val kinesisStreams = (0 until Config.numStreams).map { i =>
println(s"creating stream for $i")
if(Config.useIAMInstanceRole){
KinesisUtils.createStream(streamingContext, appName, streamName, endpointUrl, regionName,
InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2)
}else{
KinesisUtils.createStream(streamingContext, appName, streamName, endpointUrl, regionName,
InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2,awsCredentails.getAWSAccessKeyId,awsCredentails.getAWSSecretKey)
}
}
val unionStreams = streamingContext.union(kinesisStreams)
streamFunc(unionStreams)
streamingContext
}
def startStreaming(streamFunc: (DStream[Array[Byte]]) => Unit) = {
val sc = Config.getSparkContext
if(sc.defaultParallelism < Config.numStreams+1){
throw new Exception(s"Number of shards = ${Config.numStreams} , number of processor = ${sc.defaultParallelism}")
}
val streamingContext = StreamingContext.getOrCreate(Config.checkpointDirectory, () => functionToCreateContext(streamFunc))
// sys.ShutdownHookThread {
// println("Gracefully stopping Spark Streaming Application")
// streamingContext.stop(true, true)
// println("Application stopped greacefully")
// }
//
streamingContext.start()
streamingContext.awaitTermination()
}
}
DAG
raised a Jira issue : https://issues.apache.org/jira/browse/SPARK-19304
The issue is because we read more data per iteration than what is required and then discard the data. This can be avoided by adding a limit to getResults aws call.
Fix: https://github.com/apache/spark/pull/16842
When a failed driver is restart, the following occurs:
Recover computation – The checkpointed information is used to
restart the driver, reconstruct the contexts and restart all the
receivers.
Recover block metadata – The metadata of all the blocks that will be
necessary to continue the processing will be recovered.
Re-generate incomplete jobs – For the batches with processing that
has not completed due to the failure, the RDDs and corresponding
jobs are regenerated using the recovered block metadata.
Read the block saved in the logs – When those jobs are executed, the
block data is read directly from the write ahead logs. This recovers
all the necessary data that were reliably saved to the logs.
Resend unacknowledged data – The buffered data that was not saved to
the log at the time of failure will be sent again by the source. as
it had not been acknowledged by the receiver.
Since all these steps are performed at driver your batch of 0 events take so much time. This should happen with the first batch only then things will be normal.
Reference here.
I had similar issues before, my application getting slower and slower.
try to release memory after using rdd, call rdd.unpersist() https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#unpersist(boolean)
or spark.streaming.backpressure.enabled to true
http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval
http://spark.apache.org/docs/latest/streaming-programming-guide.html#requirements
also, check your locality setting, maybe too much data move around.

Resources