In our Spark Pipeline we read messages from kafka.
JavaPairDStream<byte[],byte[]> = messagesKafkaUtils.createStream(streamingContext, byte[].class, byte[].class,DefaultDecoder.class,DefaultDecoder.class,
configMap,topic,StorageLevel.MEMORY_ONLY_SER());
We transform these messages using a map function.
JavaDStream<ProcessedData> lines=messages.map(new Function<Tuple2<byte[],byte[]>, ProcessedData>()
{
public ProcessedData call(Tuple2<byte[],byte[]> tuple2)
{
}
});
//Here ProcessedData is my message bean class.
After this we save this message into Cassandra using foreachRDD function.And then we index the same message in ElasticSearch using foreachRDD function.What we require is that first the message gets stored in cassandra and it executes successfully then only it is indexed in ElasticSearch.To achieve this we require sequential execution of Cassandra and Elastic Search functions.
We are not able to generate a JavaDStream within the foreachRDD function of Cassandra to be given as input to ElasticSearch Function.
We can successfully execute the sequential execution of Cassandra and Elastic Search functions if we use map functions inside them.But then there is no Action in our Spark Pipeline and it is not executed.
Any help will be greatly appreciated.
One way to implement this sequencing would be to put the Cassandra insert and the ElasticSearch indexing within the same task.
Roughly something like this (*):
val kafkaDStream = ???
val processedData = kafkaDStream.map(elem => ProcessData(elem))
val cassandraConnector = CassandraConnector(sparkConf)
processData.forEachRDD{rdd =>
rdd.forEachPartition{partition =>
val elasClient = ??? elasticSearch client instance
partition.foreach{elem =>
cassandraConnector.withSessionDo(session =>
session.execute("INSERT ....")
}
elasClient.index(elem) // whatever the client method is called
}
}
}
We sacrifice the capability of batching operations (done internally by the Cassandra-spark connector for example) in order to implement sequencing.
(*) The structure of the Java version of this code is very similar, just more verbose.
Related
Examples borrowed from Internet, thanks to those with better insights.
The following can be found on various forums in relation to mapPartitions and map:
... Consider the case of Initializing a database. If we are using map() or
foreach(), the number of times we would need to initialize will be equal to
the no of elements in RDD. Whereas if we use mapPartitions(), the no of times
we would need to initialize would be equal to number of Partitions ...
Then there is this response:
val newRd = myRdd.mapPartitions(
partition => {
val connection = new DbConnection /*creates a db connection per partition*/
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
})
connection.close()
newPartition
})
So, my questions are after having read discussions on various items pertaining to this:
Whilst I can understand the performance improvement using mapPartitions in general, why would according to the first snippet of text, the database connection be called every time for each element of an RDD using map? I can't seem to find the right reason.
The same things does not happen with sc.textFile ... and reading into dataframes from jdbc connections. Or does it? I would be very surprised if this was so.
What am I missing...?
First of all this code is not correct. While it looks like an adaptation of the established pattern for foreachPartition it cannot be used with mapPartitions like this.
Remember that foreachPartition takes Iterator[_] and returns Iterator[_], where Iterator.map is lazy, so this code is closing connection before it is actually used.
To use some form of resource, which is initialized in mapPartitions, you'll have to use design your code in a way, that doesn't require explicit resource release.
the first snippet of text, the database connection be called every time for each element of an RDD using map? I can't seem to find the right reason.
Without the snippet in question the answer must be generic - map or foreach are not designed to handle external state. With the API shown your in your question you'd have to:
rdd.map(record => readMatchingFromDB(record, new DbConnection))
which in and obvious way creates connection for each element.
It is not impossible to use for example singleton connection pool, doing something similar to:
object Pool {
lazy val pool = ???
}
rdd.map(record => readMatchingFromDB(record, pool.getConnection))
but it is not always easy to to do it right (think about thread safety). And because connections and similar objects, cannot be in general serialized, we cannot just used closures.
In contrast foreachPartition pattern is both explicit and simple.
It is of course possible to force eager execution to make things work, for example:
val newRd = myRdd.mapPartitions(
partition => {
val connection = new DbConnection /*creates a db connection per partition*/
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
}).toList
connection.close()
newPartition.toIterator
})
but it is of course risky, can actually decrease performance.
The same things does not happen with sc.textFile ... and reading into dataframes from jdbc connections. Or does it?
Both operate using much lower API, but of course resources are not initialized for each record.
In my opinion, connection should be kept out and created just once before map and closed post task completion.
val connection = new DbConnection /creates a db connection per partition/
val newRd = myRdd.mapPartitions(
partition => {
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
})
newPartition
})
connection.close()
i'm in a project using spark 2.2 struct streaming to read kafka msg into oracle database. the message flow into kafka is about 4000-6000 messages per second .
when using hdfs file system as sink destination ,it just works fine. when using foreach jdbc writer,it will have a huge delay over time . I think the lag is caused by foreach loop .
the jdbc sink class(stand alone class file):
class JDBCSink(url: String, user: String, pwd: String) extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {
val driver = "oracle.jdbc.driver.OracleDriver"
var connection: java.sql.Connection = _
var statement: java.sql.PreparedStatement = _
val v_sql = "insert INTO sparkdb.t_cf(EntityId,clientmac,stime,flag,id) values(?,?,to_date(?,'YYYY-MM-DD HH24:MI:SS'),?,stream_seq.nextval)"
def open(partitionId: Long, version: Long): Boolean = {
Class.forName(driver)
connection = java.sql.DriverManager.getConnection(url, user, pwd)
connection.setAutoCommit(false)
statement = connection.prepareStatement(v_sql)
true
}
def process(value: org.apache.spark.sql.Row): Unit = {
statement.setString(1, value(0).toString)
statement.setString(2, value(1).toString)
statement.setString(3, value(2).toString)
statement.setString(4, value(3).toString)
statement.executeUpdate()
}
def close(errorOrNull: Throwable): Unit = {
connection.commit()
connection.close
}
}
the sink part :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "namenode:9092").option("fetch.message.max.bytes", "50000000").option("kafka.max.partition.fetch.bytes", "50000000")
.option("subscribe", "rawdb.raw_data")
.option("startingOffsets", "latest")
.load()
.select($"value".as[Array[Byte]])
.map(avroDeserialize(_))
.filter(some logic).select(some logic)
.writeStream.format("csv").option("checkpointLocation", "/user/root/chk").option("path", "/user/root/testdir").start()
if I change the last line
.writeStream.format("csv")...
into jdbc foreach sink as following:
val url = "jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=x.x.x.x)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=fastdb)))"
val user = "user";
val pwd = "password";
val writer = new JDBCSink(url, user, pwd)
.writeStream.foreach(writer).outputMode("append").start()
the lag show up.
I guess the problem most likely caused by foreach loop mechanics-it's not in batch mode deal with like several thousands row in a batch ,as an oracle DBA either, I have fine tuned oracle database side ,mostly the database is waiting for idle events . excessive commit is trying to be avoided by setting connection.setAutoCommit(false) already,any suggestion will be much appreciate.
Although I don't have an actual profile of whats taking the longest time in your application, I would assume it is due to the fact that using ForeachWriter will effectively close and re-open your JDBC connection on each run, because that's how ForeachWriter works.
I would advise that instead of using it, write a custom Sink for JDBC where you control how the connection is opened or closed.
There is an open pull request to add a JDBC driver to Spark which you can take a peek at to see a possible approach to the implementation.
problem solved by injecting the result into another Kafka topic , then wrote another program read from the new topic write them into database on batches .
I think in next spark release,they might provide the jdbc sink and have some parameter setting batch size .
the main code is as following :
write to another topic:
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "fastdbtest")
.option("checkpointLocation", "/user/root/chk")
.start()
read the topic and write to databases,i'm using c3p0 connection pool
lines.foreachRDD(rdd => {
if (!rdd.isEmpty) {
rdd.foreachPartition(partitionRecords => {
//get a connection from connection pool
val conn = ConnManager.getManager.getConnection
val ps = conn.prepareStatement("insert into sparkdb.t_cf(ENTITYID,CLIENTMAC,STIME,FLAG) values(?,?,?,?)")
try {
conn.setAutoCommit(false)
partitionRecords.foreach(record => {
insertIntoDB(ps, record)
}
)
ps.executeBatch()
conn.commit()
} catch {
case e: Exception =>{}
// do some log
} finally {
ps.close()
conn.close()
}
})
}
})
Have you tried using a trigger?
I notice when I didn't use a trigger my Foreach Sink open and close several times the connection to the database.
writeStream.foreach(writer).start()
But when I used a trigger, the Foreach only opened and closed the connection one time, processing for example 200 queries and when the micro-batch was ended it closed the connection until a new micro batch was received.
writeStream.trigger(Trigger.ProcessingTime("3 seconds")).foreach(writer).start()
My use case is reading from a Kafka topic with only one partition, so Spark I think is using one partition. I dont know if this solution works the same with multiple Spark partitions but my conclusion here is the Foreach process all the micro-batch at a time (row by row) in the process method and doesn't call open() and close() for every row like a lot of people think.
For the code below, does the .count() return the value back to the driver or only to the executor?
JavaPairDStream<String, String> dstream ...
stream.foreachRDD(rdd -> {
long count = rdd.count();
// some code to save count to Datastore
});
I know usually count() returns the value to the driver but I'm not sure what happens when it's inside foreacRDD?
For other related questions in the future, is there an easy way to verify if a code block executes on the driver or exeutor?
Operations that give access to an RDD, such as transform(rdd => ...) and foreachRDD(rdd => ...) execute in the context of the driver. The mind twist that gets confusing is that operations on that RDD will execute on the executors in the cluster.
For example:
stream.foreachRDD(rdd -> {
long count = rdd.count(); // the count is executed on the cluster, the result it brought back to the driver, like in core Spark
RDD<> richer = rdd.map(elem => something(elem)) // executes distributed
db.store(richer.top(10)) // executes in the driver
});
I have a Spark Streaming job to do some aggregations on an incoming Kafka Stream and save the result in Hive. However, I have about 5 Spark SQL to be run on the incoming data, which can be run concurrently as there is no dependency in transformations among these 5 and if possible, I would like to run them in concurrent fashion without waiting for the first SQL to end. They all go to separate Hive tables. For example :
// This is the Kafka inbound stream
// Code in Consumer
val stream = KafkaUtils.createDirectStream[..](...)
val metric1= Future {
computeFuture(stream, dataframe1, countIndex)
}
val metric2= Future {
computeFuture(stream, dataframe2, countIndex)
}
val metric3= Future {
computeFirstFuture(stream, dataframe3, countIndex)
}
val metric4= Future {
computeFirstFuture(stream, dataframe4, countIndex)
}
metric1.onFailure {
case e => logger.error(s"Future failed with an .... exception", e)
}
metric2.onFailure {
case e => logger.error(s"Future failed with an .... exception", e)
}
....and so on
On doing the above, the actions in Future are appearing sequential (from Spark url interface). How can I enforce concurrent execution? Using Spark 2.0, Scala 2.11.8. Do I need to create separate spark sessions using .newSession() ?
I have a spark code where the code inside Call method makes call to the memSQL database for reading from a table. My code opens a new connection object each time and closes it after the task is done. This call is made from inside the Call method. This works fine but the execution time for Spark job becomes high. What would be a better way to do this so that the spark code execution time is reduced.
Thank You.
You can use one connection per partition, like this:
rdd.foreachPartition {records =>
val connection = DB.createConnection()
//you can use your connection instance inside foreach
records.foreach { r=>
val externalData = connection.read(r.externaId)
//do something with your data
}
DB.save(records)
connection.close()
}
If you use Spark Streaming:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { records =>
val connection = DB.createConnection()
//you can use your connection instance inside foreach
records.foreach { r=>
val externalData = connection.read(r.externaId)
//do something with your data
}
DB.save(records)
connection.close()
}
}
See http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams