Spark streaming: HBase connection is closed when use hbaseMapPartitions - apache-spark

in my Spark streaming application i use a HBaseContext to put some values into HBase, one put operation for each processed message.
If i use hbaseForeachPartitions, everything is ok.
dStream
.hbaseForeachPartition(
hbaseContext,
(iterator, connection) => {
val table = connection.getTable("namespace:table")
// putHBase is external function in the same Scala object
val results = iterator.flatMap(packet => putHBaseAndOther(packet))
table.close()
results
}
)
Instead with hbaseMapPartitions the connection to HBase is closed.
dStream
.hbaseMapPartition(
hbaseContext,
(iterator, connection) => {
val table = connection.getTable("namespace:table")
// putHBase is external function in the same Scala object
val results = iterator.flatMap(packet => putHBaseAndOther(packet))
table.close()
results
}
)
Someone can explain me why?

Related

Will Spark Executor kill Lettuce thread when I'm writing a dataframe into Redis

I'm using lettuce to insert a big spark dataframe into Redis. Will the executor kills itself before Lettuce client finish inserting to Redis? The following is the code I'm using to insert data into Redis.
First, I create a RedisConnectionProvider with a connection pool.
class RedisConnectionProvider {
var pool: GenericObjectPool[StatefulRedisClusterConnection[String, String]] = _
def getPool(url: String): GenericObjectPool[StatefulRedisClusterConnection[String, String]] = {
if (pool == null) {
val clusterClient: RedisClusterClient = RedisClusterClient.create(RedisURI.create(url, 6379))
clusterClient.getPartitions
val supplier: Supplier[StatefulRedisClusterConnection[String, String]] = (() => clusterClient.connect()).asJava
val config: GenericObjectPoolConfig[StatefulRedisClusterConnection[String, String]] = new GenericObjectPoolConfig[StatefulRedisClusterConnection[String, String]]
config.setMaxTotal(30)
pool = ConnectionPoolSupport.createGenericObjectPool(supplier, config)
}
pool
}
}
object RedisConnectionProvider {
val instance = new RedisConnectionProvider()
}
Then I get the pool within each partition of my rdd and insert data with the connection I got from the pool. The RedisConnectionProvider is a scala object, so I guess each Spark compute node will only create one pool object.
val rdd = spark.sparkContext.parallelize(Seq("a", "b", "c"))
rdd.foreachPartition(partitionRecords => {
val pool = RedisConnectionProvider.instance.getPool("127.0.0.1")
val connection = pool.borrowObject
val commands = connection.reactive()
commands.setAutoFlushCommands(false)
partitionRecords.grouped(500).foreach((group: Seq[Any]) => {
Flux.fromIterable(group.map(s => s.toString))
.flatMap(((s: String) => {commands.hset(s, value)}).asJava)
.subscribe()
}
}
These pieces of code work most of the time. But I'm not sure if it's robust. Anyone could pls help?

spark 2.2 struct streaming foreach writer jdbc sink lag

i'm in a project using spark 2.2 struct streaming to read kafka msg into oracle database. the message flow into kafka is about 4000-6000 messages per second .
when using hdfs file system as sink destination ,it just works fine. when using foreach jdbc writer,it will have a huge delay over time . I think the lag is caused by foreach loop .
the jdbc sink class(stand alone class file):
class JDBCSink(url: String, user: String, pwd: String) extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {
val driver = "oracle.jdbc.driver.OracleDriver"
var connection: java.sql.Connection = _
var statement: java.sql.PreparedStatement = _
val v_sql = "insert INTO sparkdb.t_cf(EntityId,clientmac,stime,flag,id) values(?,?,to_date(?,'YYYY-MM-DD HH24:MI:SS'),?,stream_seq.nextval)"
def open(partitionId: Long, version: Long): Boolean = {
Class.forName(driver)
connection = java.sql.DriverManager.getConnection(url, user, pwd)
connection.setAutoCommit(false)
statement = connection.prepareStatement(v_sql)
true
}
def process(value: org.apache.spark.sql.Row): Unit = {
statement.setString(1, value(0).toString)
statement.setString(2, value(1).toString)
statement.setString(3, value(2).toString)
statement.setString(4, value(3).toString)
statement.executeUpdate()
}
def close(errorOrNull: Throwable): Unit = {
connection.commit()
connection.close
}
}
the sink part :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "namenode:9092").option("fetch.message.max.bytes", "50000000").option("kafka.max.partition.fetch.bytes", "50000000")
.option("subscribe", "rawdb.raw_data")
.option("startingOffsets", "latest")
.load()
.select($"value".as[Array[Byte]])
.map(avroDeserialize(_))
.filter(some logic).select(some logic)
.writeStream.format("csv").option("checkpointLocation", "/user/root/chk").option("path", "/user/root/testdir").start()
if I change the last line
.writeStream.format("csv")...
into jdbc foreach sink as following:
val url = "jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=x.x.x.x)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=fastdb)))"
val user = "user";
val pwd = "password";
val writer = new JDBCSink(url, user, pwd)
.writeStream.foreach(writer).outputMode("append").start()
the lag show up.
I guess the problem most likely caused by foreach loop mechanics-it's not in batch mode deal with like several thousands row in a batch ,as an oracle DBA either, I have fine tuned oracle database side ,mostly the database is waiting for idle events . excessive commit is trying to be avoided by setting connection.setAutoCommit(false) already,any suggestion will be much appreciate.
Although I don't have an actual profile of whats taking the longest time in your application, I would assume it is due to the fact that using ForeachWriter will effectively close and re-open your JDBC connection on each run, because that's how ForeachWriter works.
I would advise that instead of using it, write a custom Sink for JDBC where you control how the connection is opened or closed.
There is an open pull request to add a JDBC driver to Spark which you can take a peek at to see a possible approach to the implementation.
problem solved by injecting the result into another Kafka topic , then wrote another program read from the new topic write them into database on batches .
I think in next spark release,they might provide the jdbc sink and have some parameter setting batch size .
the main code is as following :
write to another topic:
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "fastdbtest")
.option("checkpointLocation", "/user/root/chk")
.start()
read the topic and write to databases,i'm using c3p0 connection pool
lines.foreachRDD(rdd => {
if (!rdd.isEmpty) {
rdd.foreachPartition(partitionRecords => {
//get a connection from connection pool
val conn = ConnManager.getManager.getConnection
val ps = conn.prepareStatement("insert into sparkdb.t_cf(ENTITYID,CLIENTMAC,STIME,FLAG) values(?,?,?,?)")
try {
conn.setAutoCommit(false)
partitionRecords.foreach(record => {
insertIntoDB(ps, record)
}
)
ps.executeBatch()
conn.commit()
} catch {
case e: Exception =>{}
// do some log
} finally {
ps.close()
conn.close()
}
})
}
})
Have you tried using a trigger?
I notice when I didn't use a trigger my Foreach Sink open and close several times the connection to the database.
writeStream.foreach(writer).start()
But when I used a trigger, the Foreach only opened and closed the connection one time, processing for example 200 queries and when the micro-batch was ended it closed the connection until a new micro batch was received.
writeStream.trigger(Trigger.ProcessingTime("3 seconds")).foreach(writer).start()
My use case is reading from a Kafka topic with only one partition, so Spark I think is using one partition. I dont know if this solution works the same with multiple Spark partitions but my conclusion here is the Foreach process all the micro-batch at a time (row by row) in the process method and doesn't call open() and close() for every row like a lot of people think.

Concurrent execution in Spark Streaming

I have a Spark Streaming job to do some aggregations on an incoming Kafka Stream and save the result in Hive. However, I have about 5 Spark SQL to be run on the incoming data, which can be run concurrently as there is no dependency in transformations among these 5 and if possible, I would like to run them in concurrent fashion without waiting for the first SQL to end. They all go to separate Hive tables. For example :
// This is the Kafka inbound stream
// Code in Consumer
val stream = KafkaUtils.createDirectStream[..](...)
val metric1= Future {
computeFuture(stream, dataframe1, countIndex)
}
val metric2= Future {
computeFuture(stream, dataframe2, countIndex)
}
val metric3= Future {
computeFirstFuture(stream, dataframe3, countIndex)
}
val metric4= Future {
computeFirstFuture(stream, dataframe4, countIndex)
}
metric1.onFailure {
case e => logger.error(s"Future failed with an .... exception", e)
}
metric2.onFailure {
case e => logger.error(s"Future failed with an .... exception", e)
}
....and so on
On doing the above, the actions in Future are appearing sequential (from Spark url interface). How can I enforce concurrent execution? Using Spark 2.0, Scala 2.11.8. Do I need to create separate spark sessions using .newSession() ?

What is the correct way of using memSQL Connection object inside call method of Apache Spark code

I have a spark code where the code inside Call method makes call to the memSQL database for reading from a table. My code opens a new connection object each time and closes it after the task is done. This call is made from inside the Call method. This works fine but the execution time for Spark job becomes high. What would be a better way to do this so that the spark code execution time is reduced.
Thank You.
You can use one connection per partition, like this:
rdd.foreachPartition {records =>
val connection = DB.createConnection()
//you can use your connection instance inside foreach
records.foreach { r=>
val externalData = connection.read(r.externaId)
//do something with your data
}
DB.save(records)
connection.close()
}
If you use Spark Streaming:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { records =>
val connection = DB.createConnection()
//you can use your connection instance inside foreach
records.foreach { r=>
val externalData = connection.read(r.externaId)
//do something with your data
}
DB.save(records)
connection.close()
}
}
See http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

Unable to Implement Sequential execution of Spark Functions

In our Spark Pipeline we read messages from kafka.
JavaPairDStream<byte[],byte[]> = messagesKafkaUtils.createStream(streamingContext, byte[].class, byte[].class,DefaultDecoder.class,DefaultDecoder.class,
configMap,topic,StorageLevel.MEMORY_ONLY_SER());
We transform these messages using a map function.
JavaDStream<ProcessedData> lines=messages.map(new Function<Tuple2<byte[],byte[]>, ProcessedData>()
{
public ProcessedData call(Tuple2<byte[],byte[]> tuple2)
{
}
});
//Here ProcessedData is my message bean class.
After this we save this message into Cassandra using foreachRDD function.And then we index the same message in ElasticSearch using foreachRDD function.What we require is that first the message gets stored in cassandra and it executes successfully then only it is indexed in ElasticSearch.To achieve this we require sequential execution of Cassandra and Elastic Search functions.
We are not able to generate a JavaDStream within the foreachRDD function of Cassandra to be given as input to ElasticSearch Function.
We can successfully execute the sequential execution of Cassandra and Elastic Search functions if we use map functions inside them.But then there is no Action in our Spark Pipeline and it is not executed.
Any help will be greatly appreciated.
One way to implement this sequencing would be to put the Cassandra insert and the ElasticSearch indexing within the same task.
Roughly something like this (*):
val kafkaDStream = ???
val processedData = kafkaDStream.map(elem => ProcessData(elem))
val cassandraConnector = CassandraConnector(sparkConf)
processData.forEachRDD{rdd =>
rdd.forEachPartition{partition =>
val elasClient = ??? elasticSearch client instance
partition.foreach{elem =>
cassandraConnector.withSessionDo(session =>
session.execute("INSERT ....")
}
elasClient.index(elem) // whatever the client method is called
}
}
}
We sacrifice the capability of batching operations (done internally by the Cassandra-spark connector for example) in order to implement sequencing.
(*) The structure of the Java version of this code is very similar, just more verbose.

Resources