Concurrent execution in Spark Streaming - apache-spark

I have a Spark Streaming job to do some aggregations on an incoming Kafka Stream and save the result in Hive. However, I have about 5 Spark SQL to be run on the incoming data, which can be run concurrently as there is no dependency in transformations among these 5 and if possible, I would like to run them in concurrent fashion without waiting for the first SQL to end. They all go to separate Hive tables. For example :
// This is the Kafka inbound stream
// Code in Consumer
val stream = KafkaUtils.createDirectStream[..](...)
val metric1= Future {
computeFuture(stream, dataframe1, countIndex)
}
val metric2= Future {
computeFuture(stream, dataframe2, countIndex)
}
val metric3= Future {
computeFirstFuture(stream, dataframe3, countIndex)
}
val metric4= Future {
computeFirstFuture(stream, dataframe4, countIndex)
}
metric1.onFailure {
case e => logger.error(s"Future failed with an .... exception", e)
}
metric2.onFailure {
case e => logger.error(s"Future failed with an .... exception", e)
}
....and so on
On doing the above, the actions in Future are appearing sequential (from Spark url interface). How can I enforce concurrent execution? Using Spark 2.0, Scala 2.11.8. Do I need to create separate spark sessions using .newSession() ?

Related

slow performance of spark

I am new to Spark. I have a requirement where I need to integrate Spark with Web Service. Any request to a Web service has to be processed using Spark and send the response back to the client.
I have created a small dummy service in Vertx, which accepts request and processes it using Spark. I am using Spark in cluster mode (1 master, 2 slaves, 8 core, 32 Gb each, running on top of Yarn and Hdfs)
public class WebServer {
private static SparkSession spark;
private static void createSparkSession(String masterUrl) {
SparkContext context = new SparkContext(new SparkConf().setAppName("Web Load Test App").setMaster(masterUrl)
.set("spark.hadoop.fs.default.name", "hdfs://x.x.x.x:9000")
.set("spark.hadoop.fs.defaultFS", "hdfs://x.x.x.x:9000")
.set("spark.hadoop.fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName())
.set("spark.hadoop.fs.hdfs.server", org.apache.hadoop.hdfs.server.namenode.NameNode.class.getName())
.set("spark.hadoop.conf", org.apache.hadoop.hdfs.HdfsConfiguration.class.getName())
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "hdfs://x.x.x.x:9000/spark-logs")
.set("spark.history.provider", "org.apache.spark.deploy.history.FsHistoryProvider")
.set("spark.history.fs.logDirectory", "hdfs://x.x.x.x:9000/spark-logs")
.set("spark.history.fs.update.interval", "10s")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//.set("spark.kryoserializer.buffer.max", "1g")
//.set("spark.kryoserializer.buffer", "512m")
);
spark = SparkSession.builder().sparkContext(context).getOrCreate();
}
public static void main(String[] args) {
Vertx vertx = Vertx.vertx();
String masterUrl = "local[2]";
if(args.length == 1) {
masterUrl = args[0];
}
System.out.println("Master url: " + masterUrl);
createSparkSession(masterUrl);
WebServer webServer = new WebServer();
webServer.start(vertx);
}
private void start(Vertx vertx) {
int port = 19090;
Router router = Router.router(vertx);
router.route("/word_count*").handler(BodyHandler.create());
router.post("/word_count").handler(this::calWordFrequency);
vertx.createHttpServer()
.requestHandler(router::accept)
.listen(port,
result -> {
if (result.succeeded()) {
System.out.println("Server started # " + port);
} else {
System.out.println("Server failed to start # " + port);
}
});
}
private void calWordFrequency(RoutingContext routingContext) {
WordCountRequest wordCountRequest = routingContext.getBodyAsJson().mapTo(WordCountRequest.class);
List<String> words = wordCountRequest.getWords();
Dataset<String> wordsDataset = spark.createDataset(words, Encoders.STRING());
Dataset<Row> wordCounts = wordsDataset.groupBy("value").count();
List<String> result = wordCounts.toJSON().collectAsList();
routingContext.response().setStatusCode(300).putHeader("content-type", "application/json; charset=utf-8")
.end(Json.encodePrettily(result));
}
}
When I make a post request with payload size of about 5kb, then it is taking around 6 seconds to complete the request and the response back. I feel it is very slow.
However, if I carry a simple example of reading file from Hbase and performing transformation and displaying result, it is very fast. I am able to processes a file of 8Gb file in 2mins.
Eg:
logFile="/spark-logs/single-word-line.less.txt"
master_node = 'spark://x.x.x.x:7077'
spark = SparkSession.builder.master(master_node).appName('Avi-load-test').getOrCreate()
log_data = spark.read.text(logFile)
word_count = (log_data.groupBy('value').count())
print(word_count.show())
What is the reason for my application to run so slow? Any pointers would be really helpful. Thank you in advance.
Spark processing is asynchronous, you are using it as part of a synchronous flow. You can do that but can't expect the processing to be finished. We have implemented similar use case - we have a REST service which triggers a spark job. The implementation does return spark job id(takes some time). And we have another end point to get job status using the Job Id, but we didn't implement a flow to return data from spark job through REST srvc and it is not recommended.
Our flow
REST API <-> Spark Job <-> HBase/Kafka
The REST endpoint triggers a Spark Job and the Spark job reads data from HBase and does the processing and write data back to HBase and Kafka. The REST API is called by a different and they receive the data that is processed through Kafka.
I think you need to rethink your architecture.

Safe to do parallel JDBC insert inside foreachPartition using 2 scala Futures?

I currently have a Spark streaming app which inserts into MySQL database. I'm following the pattern of
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val connection = createNewConnection()
partitionOfRecords.foreach(record => connection.send(record))
connection.close()
}
}
that is found in the spark streaming documentation. The thing is I'd like to modify it as such,
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val f1 = Future {
//insert data to database 1
val connection = createNewConnection(db1_config)
partitionOfRecords.foreach(record => connection.send(record))
connection.close()
}
}
val f2 = Future {
//insert data to database 2
val connection = createNewConnection(db2_config)
partitionOfRecords.foreach(record => connection.send(record))
connection.close()
}
//await for both f1 and f2 to finish
}
and actually, I found that this works but I want to make sure I'm not shooting myself in the foot. I tested this in my app using 1 executor and 1 core and found that both futures work in parallel when I'm using the global Execution Context. So I'm assuming that the global Execution Context gave 2 threads to work on. But where do those 2 threads come from? When I submitted my Spark job to YARN I specified 1 executor and 1 core. When I do LSCPU on that machine I find that I have 2 Threads per core. So am I using those 2 threads on that core or does the global execution context give me available threads from another core? I just want to know if this is a safe pattern and not "steal" threads from other applications.

Kafka Producer Recovery Strategy when consuming a parquet file

I have a use case, where I have to read parquet files and publish the records towards a kafka topic.
I read the parquet files using :
spark.read.schema(//path to parquet file )
Then I sort this dataframe based on the timestamp (use case specific requirements to preserve order)
Finally, I do the following :
binaryFiles below is the dataframe containing the sorted records from the parquet file
binaryFiles.coalesce(1).foreachPartition(partition => {
val producer = new KafkaProducer[String,Array[Byte]](properties)
partition.foreach(file => {
try {
var producerRecord = new ProducerRecord[String,Array[Byte]](targetTopic,file.getAs[Integer](2),file.getAs[String](0),file.getAs[Array[Byte]](1))
var metadata = producer.send(producerRecord, new Callback {
override def onCompletion(recordMetadata:RecordMetadata , e:Exception):Unit = {
if (e != null){
println ("Error while producing" + e);
producer.close()
}
}
});
producer.flush()
}
catch{
case unknown :Throwable => println("Exception obtained with record. Key : " + unknown)
}
})
producer.close()
println("Closing the producer for this partition")
})
While writing the failover strategy for this scenario, one scenario that I am trying to cater to is what if the node that runs the kafka producer goes down.
Now when the kafka producer is restarted it will again read the parquet file from start and start publishing all the records again to the same topic.
How can we overcome this and implement some sort of checkpointing that spark stream provides.
PS: I cannot use spark structured streaming as that does not preserve the order of the messages

spark 2.2 struct streaming foreach writer jdbc sink lag

i'm in a project using spark 2.2 struct streaming to read kafka msg into oracle database. the message flow into kafka is about 4000-6000 messages per second .
when using hdfs file system as sink destination ,it just works fine. when using foreach jdbc writer,it will have a huge delay over time . I think the lag is caused by foreach loop .
the jdbc sink class(stand alone class file):
class JDBCSink(url: String, user: String, pwd: String) extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {
val driver = "oracle.jdbc.driver.OracleDriver"
var connection: java.sql.Connection = _
var statement: java.sql.PreparedStatement = _
val v_sql = "insert INTO sparkdb.t_cf(EntityId,clientmac,stime,flag,id) values(?,?,to_date(?,'YYYY-MM-DD HH24:MI:SS'),?,stream_seq.nextval)"
def open(partitionId: Long, version: Long): Boolean = {
Class.forName(driver)
connection = java.sql.DriverManager.getConnection(url, user, pwd)
connection.setAutoCommit(false)
statement = connection.prepareStatement(v_sql)
true
}
def process(value: org.apache.spark.sql.Row): Unit = {
statement.setString(1, value(0).toString)
statement.setString(2, value(1).toString)
statement.setString(3, value(2).toString)
statement.setString(4, value(3).toString)
statement.executeUpdate()
}
def close(errorOrNull: Throwable): Unit = {
connection.commit()
connection.close
}
}
the sink part :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "namenode:9092").option("fetch.message.max.bytes", "50000000").option("kafka.max.partition.fetch.bytes", "50000000")
.option("subscribe", "rawdb.raw_data")
.option("startingOffsets", "latest")
.load()
.select($"value".as[Array[Byte]])
.map(avroDeserialize(_))
.filter(some logic).select(some logic)
.writeStream.format("csv").option("checkpointLocation", "/user/root/chk").option("path", "/user/root/testdir").start()
if I change the last line
.writeStream.format("csv")...
into jdbc foreach sink as following:
val url = "jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=x.x.x.x)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=fastdb)))"
val user = "user";
val pwd = "password";
val writer = new JDBCSink(url, user, pwd)
.writeStream.foreach(writer).outputMode("append").start()
the lag show up.
I guess the problem most likely caused by foreach loop mechanics-it's not in batch mode deal with like several thousands row in a batch ,as an oracle DBA either, I have fine tuned oracle database side ,mostly the database is waiting for idle events . excessive commit is trying to be avoided by setting connection.setAutoCommit(false) already,any suggestion will be much appreciate.
Although I don't have an actual profile of whats taking the longest time in your application, I would assume it is due to the fact that using ForeachWriter will effectively close and re-open your JDBC connection on each run, because that's how ForeachWriter works.
I would advise that instead of using it, write a custom Sink for JDBC where you control how the connection is opened or closed.
There is an open pull request to add a JDBC driver to Spark which you can take a peek at to see a possible approach to the implementation.
problem solved by injecting the result into another Kafka topic , then wrote another program read from the new topic write them into database on batches .
I think in next spark release,they might provide the jdbc sink and have some parameter setting batch size .
the main code is as following :
write to another topic:
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "fastdbtest")
.option("checkpointLocation", "/user/root/chk")
.start()
read the topic and write to databases,i'm using c3p0 connection pool
lines.foreachRDD(rdd => {
if (!rdd.isEmpty) {
rdd.foreachPartition(partitionRecords => {
//get a connection from connection pool
val conn = ConnManager.getManager.getConnection
val ps = conn.prepareStatement("insert into sparkdb.t_cf(ENTITYID,CLIENTMAC,STIME,FLAG) values(?,?,?,?)")
try {
conn.setAutoCommit(false)
partitionRecords.foreach(record => {
insertIntoDB(ps, record)
}
)
ps.executeBatch()
conn.commit()
} catch {
case e: Exception =>{}
// do some log
} finally {
ps.close()
conn.close()
}
})
}
})
Have you tried using a trigger?
I notice when I didn't use a trigger my Foreach Sink open and close several times the connection to the database.
writeStream.foreach(writer).start()
But when I used a trigger, the Foreach only opened and closed the connection one time, processing for example 200 queries and when the micro-batch was ended it closed the connection until a new micro batch was received.
writeStream.trigger(Trigger.ProcessingTime("3 seconds")).foreach(writer).start()
My use case is reading from a Kafka topic with only one partition, so Spark I think is using one partition. I dont know if this solution works the same with multiple Spark partitions but my conclusion here is the Foreach process all the micro-batch at a time (row by row) in the process method and doesn't call open() and close() for every row like a lot of people think.

Unable to Implement Sequential execution of Spark Functions

In our Spark Pipeline we read messages from kafka.
JavaPairDStream<byte[],byte[]> = messagesKafkaUtils.createStream(streamingContext, byte[].class, byte[].class,DefaultDecoder.class,DefaultDecoder.class,
configMap,topic,StorageLevel.MEMORY_ONLY_SER());
We transform these messages using a map function.
JavaDStream<ProcessedData> lines=messages.map(new Function<Tuple2<byte[],byte[]>, ProcessedData>()
{
public ProcessedData call(Tuple2<byte[],byte[]> tuple2)
{
}
});
//Here ProcessedData is my message bean class.
After this we save this message into Cassandra using foreachRDD function.And then we index the same message in ElasticSearch using foreachRDD function.What we require is that first the message gets stored in cassandra and it executes successfully then only it is indexed in ElasticSearch.To achieve this we require sequential execution of Cassandra and Elastic Search functions.
We are not able to generate a JavaDStream within the foreachRDD function of Cassandra to be given as input to ElasticSearch Function.
We can successfully execute the sequential execution of Cassandra and Elastic Search functions if we use map functions inside them.But then there is no Action in our Spark Pipeline and it is not executed.
Any help will be greatly appreciated.
One way to implement this sequencing would be to put the Cassandra insert and the ElasticSearch indexing within the same task.
Roughly something like this (*):
val kafkaDStream = ???
val processedData = kafkaDStream.map(elem => ProcessData(elem))
val cassandraConnector = CassandraConnector(sparkConf)
processData.forEachRDD{rdd =>
rdd.forEachPartition{partition =>
val elasClient = ??? elasticSearch client instance
partition.foreach{elem =>
cassandraConnector.withSessionDo(session =>
session.execute("INSERT ....")
}
elasClient.index(elem) // whatever the client method is called
}
}
}
We sacrifice the capability of batching operations (done internally by the Cassandra-spark connector for example) in order to implement sequencing.
(*) The structure of the Java version of this code is very similar, just more verbose.

Resources