SPARK Cost of Initializing Database Connection in map / mapPartitions context - apache-spark

Examples borrowed from Internet, thanks to those with better insights.
The following can be found on various forums in relation to mapPartitions and map:
... Consider the case of Initializing a database. If we are using map() or
foreach(), the number of times we would need to initialize will be equal to
the no of elements in RDD. Whereas if we use mapPartitions(), the no of times
we would need to initialize would be equal to number of Partitions ...
Then there is this response:
val newRd = myRdd.mapPartitions(
partition => {
val connection = new DbConnection /*creates a db connection per partition*/
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
})
connection.close()
newPartition
})
So, my questions are after having read discussions on various items pertaining to this:
Whilst I can understand the performance improvement using mapPartitions in general, why would according to the first snippet of text, the database connection be called every time for each element of an RDD using map? I can't seem to find the right reason.
The same things does not happen with sc.textFile ... and reading into dataframes from jdbc connections. Or does it? I would be very surprised if this was so.
What am I missing...?

First of all this code is not correct. While it looks like an adaptation of the established pattern for foreachPartition it cannot be used with mapPartitions like this.
Remember that foreachPartition takes Iterator[_] and returns Iterator[_], where Iterator.map is lazy, so this code is closing connection before it is actually used.
To use some form of resource, which is initialized in mapPartitions, you'll have to use design your code in a way, that doesn't require explicit resource release.
the first snippet of text, the database connection be called every time for each element of an RDD using map? I can't seem to find the right reason.
Without the snippet in question the answer must be generic - map or foreach are not designed to handle external state. With the API shown your in your question you'd have to:
rdd.map(record => readMatchingFromDB(record, new DbConnection))
which in and obvious way creates connection for each element.
It is not impossible to use for example singleton connection pool, doing something similar to:
object Pool {
lazy val pool = ???
}
rdd.map(record => readMatchingFromDB(record, pool.getConnection))
but it is not always easy to to do it right (think about thread safety). And because connections and similar objects, cannot be in general serialized, we cannot just used closures.
In contrast foreachPartition pattern is both explicit and simple.
It is of course possible to force eager execution to make things work, for example:
val newRd = myRdd.mapPartitions(
partition => {
val connection = new DbConnection /*creates a db connection per partition*/
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
}).toList
connection.close()
newPartition.toIterator
})
but it is of course risky, can actually decrease performance.
The same things does not happen with sc.textFile ... and reading into dataframes from jdbc connections. Or does it?
Both operate using much lower API, but of course resources are not initialized for each record.

In my opinion, connection should be kept out and created just once before map and closed post task completion.
val connection = new DbConnection /creates a db connection per partition/
val newRd = myRdd.mapPartitions(
partition => {
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
})
newPartition
})
connection.close()

Related

Spark Structured Streaming redis sink perform not desirable

I've used spark structured streaming conume kafka messages and save data to redis. By extending the ForeachWriter[org.apache.spark.sql.Row], I used a redis sink to save data. The code runs well but just a little more than 100 datas be saved to redis per second. Is there any better way to speed up the procedure? While code like below would connect and disconnect to redis server every mico batch, any way to just connect once and keep the connections to miniminze the cost of connection which I supposed is the main cause of time consuming?
I tried broadcast jedis but neither jedis nor jedispool isserializable so it didn't work.
My sink code is below:
class StreamDataSink extends ForeachWriter[org.apache.spark.sql.Row]{
var jedis:Jedis = _
override def open(partitionId:Long,version:Long):Boolean={
if(null == jedis){
jedis = FPCRedisUtils.getPool.getResource
}
true
}
override def process(record: Row): Unit = {
if(0 == record(3)){
jedis.select(Constants.REDIS_DATABASE_INDEX)
if(jedis.exists("counter")){
jedis.incr("counter")
}else{
jedis.set("counter",1.toString)
}
}
}
override def close(errorOrNull: Throwable): Unit = {
if(null != jedis){
jedis.close()
jedis.disconnect()
}
}
Any suggestions will be appreciated.
Don't do jedis.disconnect(). This will actually close the socket, forcing a new connection next time around. Use only jedis.close(), it will return the connection to the pool.
When you call INCR on a non-existing key, it is automatically created, default to zero and then incremented, resulting in a new key with value 1.
This simplifies your if-else to simply jedis.incr("counter").
With this you have:
jedis.select(Constants.REDIS_DATABASE_INDEX)
jedis.incr("counter")
Review if you really need the SELECT. This is per connection and all connections default to DB 0. If all workloads sharing the same jedis pool are using DB 0, there is no need to call select.
If you do need both select and incr, then pipeline them:
Pipeline pipelined = jedis.pipelined()
pipelined.select(Constants.REDIS_DATABASE_INDEX)
pipelined.incr("counter")
pipelined.sync()
This will send the two commands in one network message, further improving your performance.

Does doing rdd.count() inside foreachRDD return results to Driver or Executor?

For the code below, does the .count() return the value back to the driver or only to the executor?
JavaPairDStream<String, String> dstream ...
stream.foreachRDD(rdd -> {
long count = rdd.count();
// some code to save count to Datastore
});
I know usually count() returns the value to the driver but I'm not sure what happens when it's inside foreacRDD?
For other related questions in the future, is there an easy way to verify if a code block executes on the driver or exeutor?
Operations that give access to an RDD, such as transform(rdd => ...) and foreachRDD(rdd => ...) execute in the context of the driver. The mind twist that gets confusing is that operations on that RDD will execute on the executors in the cluster.
For example:
stream.foreachRDD(rdd -> {
long count = rdd.count(); // the count is executed on the cluster, the result it brought back to the driver, like in core Spark
RDD<> richer = rdd.map(elem => something(elem)) // executes distributed
db.store(richer.top(10)) // executes in the driver
});

What is the correct way of using memSQL Connection object inside call method of Apache Spark code

I have a spark code where the code inside Call method makes call to the memSQL database for reading from a table. My code opens a new connection object each time and closes it after the task is done. This call is made from inside the Call method. This works fine but the execution time for Spark job becomes high. What would be a better way to do this so that the spark code execution time is reduced.
Thank You.
You can use one connection per partition, like this:
rdd.foreachPartition {records =>
val connection = DB.createConnection()
//you can use your connection instance inside foreach
records.foreach { r=>
val externalData = connection.read(r.externaId)
//do something with your data
}
DB.save(records)
connection.close()
}
If you use Spark Streaming:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { records =>
val connection = DB.createConnection()
//you can use your connection instance inside foreach
records.foreach { r=>
val externalData = connection.read(r.externaId)
//do something with your data
}
DB.save(records)
connection.close()
}
}
See http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

Node calling postgres function with temp tables causing "memory leak"

I have a node.js program calling a Postgres (Amazon RDS micro instance) function, get_jobs within a transaction, 18 times a second using the node-postgres package by brianc.
The node code is just an enhanced version of brianc's basic client pooling example, roughly like...
var pg = require('pg');
var conString = "postgres://username:password#server/database";
function getJobs(cb) {
pg.connect(conString, function(err, client, done) {
if (err) return console.error('error fetching client from pool', err);
client.query("BEGIN;");
client.query('select * from get_jobs()', [], function(err, result) {
client.query("COMMIT;");
done(); //call `done()` to release the client back to the pool
if (err) console.error('error running query', err);
cb(err, result);
});
});
}
function poll() {
getJobs(function(jobs) {
// process the jobs
});
setTimeout(poll, 55);
}
poll(); // start polling
So Postgres is getting:
2016-04-20 12:04:33 UTC:172.31.9.180(38446):XXX#XXX:[5778]:LOG: statement: BEGIN;
2016-04-20 12:04:33 UTC:172.31.9.180(38446):XXX#XXX:[5778]:LOG: execute <unnamed>: select * from get_jobs();
2016-04-20 12:04:33 UTC:172.31.9.180(38446):XXX#XXX:[5778]:LOG: statement: COMMIT;
... repeated every 55ms.
get_jobs is written with temp tables, something like this
CREATE OR REPLACE FUNCTION get_jobs (
) RETURNS TABLE (
...
) AS
$BODY$
DECLARE
_nowstamp bigint;
BEGIN
-- take the current unix server time in ms
_nowstamp := (select extract(epoch from now()) * 1000)::bigint;
-- 1. get the jobs that are due
CREATE TEMP TABLE jobs ON COMMIT DROP AS
select ...
from really_big_table_1
where job_time < _nowstamp;
-- 2. get other stuff attached to those jobs
CREATE TEMP TABLE jobs_extra ON COMMIT DROP AS
select ...
from really_big_table_2 r
inner join jobs j on r.id = j.some_id
ALTER TABLE jobs_extra ADD PRIMARY KEY (id);
-- 3. return the final result with a join to a third big table
RETURN query (
select je.id, ...
from jobs_extra je
left join really_big_table_3 r on je.id = r.id
group by je.id
);
END
$BODY$ LANGUAGE plpgsql VOLATILE;
I've used the temp table pattern because I know that jobs will always be a small extract of rows from really_big_table_1, in hopes that this will scale better than a single query with multiple joins and multiple where conditions. (I used this to great effect with SQL Server and I don't trust any query optimiser now, but please tell me if this is the wrong approach for Postgres!)
The query runs in 8ms on small tables (as measured from node), ample time to complete one job "poll" before the next one starts.
Problem: After about 3 hours of polling at this rate, the Postgres server runs out of memory and crashes.
What I tried already...
If I re-write the function without temp tables, Postgres doesn't run out of memory, but I use the temp table pattern a lot, so this isn't a solution.
If I stop the node program (which kills the 10 connections it uses to run the queries) the memory frees up. Merely making node wait a minute between polling sessions doesn't have the same effect, so there are obviously resources that the Postgres backend associated with the pooled connection is keeping.
If I run a VACUUM while polling is going on, it has no effect on memory consumption and the server continues on its way to death.
Reducing the polling frequency only changes the amount of time before the server dies.
Adding DISCARD ALL; after each COMMIT; has no effect.
Explicitly calling DROP TABLE jobs; DROP TABLE jobs_extra; after RETURN query () instead of ON COMMIT DROPs on the CREATE TABLEs. Server still crashes.
Per CFrei's suggestion, added pg.defaults.poolSize = 0 to the node code in an attempt to disable pooling. The server still crashed, but took much longer and swap went much higher (second spike) than all the previous tests which looked like the first spike below. I found out later that pg.defaults.poolSize = 0 may not disable pooling as expected.
On the basis of this: "Temporary tables cannot be accessed by autovacuum. Therefore, appropriate vacuum and analyze operations should be performed via session SQL commands.", I tried to run a VACUUM from the node server (as some attempt to make VACUUM an "in session" command). I couldn't actually get this test working. I have many objects in my database and VACUUM, operating on all objects, was taking too long to execute each job iteration. Restricting VACUUM just to the temp tables was impossible - (a) you can't run VACUUM in a transaction and (b) outside the transaction the temp tables don't exist. :P EDIT: Later on the Postgres IRC forum, a helpful chap explained that VACUUM isn't relevant for temp tables themselves, but can be useful to clean up the rows created and deleted from pg_attributes that TEMP TABLES cause. In any case, VACUUMing "in session" wasn't the answer.
DROP TABLE ... IF EXISTS before the CREATE TABLE, instead of ON COMMIT DROP. Server still dies.
CREATE TEMP TABLE (...) and insert into ... (select...) instead of CREATE TEMP TABLE ... AS, instead of ON COMMIT DROP. Server dies.
So is ON COMMIT DROP not releasing all the associated resources? What else could be holding memory? How do I release it?
I used this to great effect with SQL Server and I don't trust any query optimiser now
Then don't use them. You can still execute queries directly, as shown below.
but please tell me if this is the wrong approach for Postgres!
It is not a completely wrong approach, it's just a very awkward one, as you are trying to create something that's been implemented by others for a much easier use. As a result, you are making many mistakes that can lead to many problems, including memory leaks.
Compare to the simplicity of the exact same example that uses pg-promise:
var pgp = require('pg-promise')();
var conString = "postgres://username:password#server/database";
var db = pgp(conString);
function getJobs() {
return db.tx(function (t) {
return t.func('get_jobs');
});
}
function poll() {
getJobs()
.then(function (jobs) {
// process the jobs
})
.catch(function (error) {
// error
});
setTimeout(poll, 55);
}
poll(); // start polling
Gets even simpler when using ES6 syntax:
var pgp = require('pg-promise')();
var conString = "postgres://username:password#server/database";
var db = pgp(conString);
function poll() {
db.tx(t=>t.func('get_jobs'))
.then(jobs=> {
// process the jobs
})
.catch(error=> {
// error
});
setTimeout(poll, 55);
}
poll(); // start polling
The only thing that I didn't quite understand in your example - the use of a transaction to execute a single SELECT. This is not what transactions are generally for, as you are not changing any data. I assume you were trying to shrink a real piece of code you had that changes some data also.
In case you don't need a transaction, your code can be further reduced to:
var pgp = require('pg-promise')();
var conString = "postgres://username:password#server/database";
var db = pgp(conString);
function poll() {
db.func('get_jobs')
.then(jobs=> {
// process the jobs
})
.catch(error=> {
// error
});
setTimeout(poll, 55);
}
poll(); // start polling
UPDATE
It would be a dangerous approach, however, not to control the end of the previous request, which also may create memory/connection issues.
A safe approach should be:
function poll() {
db.tx(t=>t.func('get_jobs'))
.then(jobs=> {
// process the jobs
setTimeout(poll, 55);
})
.catch(error=> {
// error
setTimeout(poll, 55);
});
}
Use CTEs to create partial result sets instead of temp tables.
CREATE OR REPLACE FUNCTION get_jobs (
) RETURNS TABLE (
...
) AS
$BODY$
DECLARE
_nowstamp bigint;
BEGIN
-- take the current unix server time in ms
_nowstamp := (select extract(epoch from now()) * 1000)::bigint;
RETURN query (
-- 1. get the jobs that are due
WITH jobs AS (
select ...
from really_big_table_1
where job_time < _nowstamp;
-- 2. get other stuff attached to those jobs
), jobs_extra AS (
select ...
from really_big_table_2 r
inner join jobs j on r.id = j.some_id
)
-- 3. return the final result with a join to a third big table
select je.id, ...
from jobs_extra je
left join really_big_table_3 r on je.id = r.id
group by je.id
);
END
$BODY$ LANGUAGE plpgsql VOLATILE;
The planner will evaluate each block in sequence the way I wanted to achieve with temp tables.
I know this doesn't directly solve the memory leak issue (I'm pretty sure there's something wrong with Postgres' implementation of them, at least the way they manifest on the RDS configuration).
However, the query works, it is query planned the way I was intending and the memory usage is stable now after 3 days of running the job and my server doesn't crash.
I didn't change the node code at all.

Unable to Implement Sequential execution of Spark Functions

In our Spark Pipeline we read messages from kafka.
JavaPairDStream<byte[],byte[]> = messagesKafkaUtils.createStream(streamingContext, byte[].class, byte[].class,DefaultDecoder.class,DefaultDecoder.class,
configMap,topic,StorageLevel.MEMORY_ONLY_SER());
We transform these messages using a map function.
JavaDStream<ProcessedData> lines=messages.map(new Function<Tuple2<byte[],byte[]>, ProcessedData>()
{
public ProcessedData call(Tuple2<byte[],byte[]> tuple2)
{
}
});
//Here ProcessedData is my message bean class.
After this we save this message into Cassandra using foreachRDD function.And then we index the same message in ElasticSearch using foreachRDD function.What we require is that first the message gets stored in cassandra and it executes successfully then only it is indexed in ElasticSearch.To achieve this we require sequential execution of Cassandra and Elastic Search functions.
We are not able to generate a JavaDStream within the foreachRDD function of Cassandra to be given as input to ElasticSearch Function.
We can successfully execute the sequential execution of Cassandra and Elastic Search functions if we use map functions inside them.But then there is no Action in our Spark Pipeline and it is not executed.
Any help will be greatly appreciated.
One way to implement this sequencing would be to put the Cassandra insert and the ElasticSearch indexing within the same task.
Roughly something like this (*):
val kafkaDStream = ???
val processedData = kafkaDStream.map(elem => ProcessData(elem))
val cassandraConnector = CassandraConnector(sparkConf)
processData.forEachRDD{rdd =>
rdd.forEachPartition{partition =>
val elasClient = ??? elasticSearch client instance
partition.foreach{elem =>
cassandraConnector.withSessionDo(session =>
session.execute("INSERT ....")
}
elasClient.index(elem) // whatever the client method is called
}
}
}
We sacrifice the capability of batching operations (done internally by the Cassandra-spark connector for example) in order to implement sequencing.
(*) The structure of the Java version of this code is very similar, just more verbose.

Resources