Spark Structured Streaming redis sink perform not desirable - apache-spark

I've used spark structured streaming conume kafka messages and save data to redis. By extending the ForeachWriter[org.apache.spark.sql.Row], I used a redis sink to save data. The code runs well but just a little more than 100 datas be saved to redis per second. Is there any better way to speed up the procedure? While code like below would connect and disconnect to redis server every mico batch, any way to just connect once and keep the connections to miniminze the cost of connection which I supposed is the main cause of time consuming?
I tried broadcast jedis but neither jedis nor jedispool isserializable so it didn't work.
My sink code is below:
class StreamDataSink extends ForeachWriter[org.apache.spark.sql.Row]{
var jedis:Jedis = _
override def open(partitionId:Long,version:Long):Boolean={
if(null == jedis){
jedis = FPCRedisUtils.getPool.getResource
}
true
}
override def process(record: Row): Unit = {
if(0 == record(3)){
jedis.select(Constants.REDIS_DATABASE_INDEX)
if(jedis.exists("counter")){
jedis.incr("counter")
}else{
jedis.set("counter",1.toString)
}
}
}
override def close(errorOrNull: Throwable): Unit = {
if(null != jedis){
jedis.close()
jedis.disconnect()
}
}
Any suggestions will be appreciated.

Don't do jedis.disconnect(). This will actually close the socket, forcing a new connection next time around. Use only jedis.close(), it will return the connection to the pool.
When you call INCR on a non-existing key, it is automatically created, default to zero and then incremented, resulting in a new key with value 1.
This simplifies your if-else to simply jedis.incr("counter").
With this you have:
jedis.select(Constants.REDIS_DATABASE_INDEX)
jedis.incr("counter")
Review if you really need the SELECT. This is per connection and all connections default to DB 0. If all workloads sharing the same jedis pool are using DB 0, there is no need to call select.
If you do need both select and incr, then pipeline them:
Pipeline pipelined = jedis.pipelined()
pipelined.select(Constants.REDIS_DATABASE_INDEX)
pipelined.incr("counter")
pipelined.sync()
This will send the two commands in one network message, further improving your performance.

Related

Synchronize multiple requests to database in NestJS

in our NestJS application we are using TypeORM as ORM to work with db tables and typeorm-transactional-cls-hooked library.
now we have problem with synchronization of requests which are read and modifying database at same time.
Sample:
#Transactional()
async doMagicAndIncreaseCount (id) {
const await { currentCount } = this.fooRepository.findOne(id)
// do some stuff where I receive new count which I need add to current, for instance 10
const newCount = currentCount + 10
this.fooRepository.update(id, { currentCount: newCount })
}
When we executed this operation from the frontend multiple times at the same time, the final count is wrong. The first transaction read currentCount and then start computation, during computation started the second transaction, which read currentCount as well, and first transaction finish computation and save new currentCount, and then also second transaction finish and rewrite result of first transaction.
Our goal is to execute this operation on foo table only once at the time, and other requests should wait until.
I tried set SERIALIZABLE isolation level like this:
#Transactional({ isolationLevel: IsolationLevel.SERIALIZABLE })
which ensure that only one request is executed at time, but other requests failed with error. Can you please give me some advice how to solve that?
I never used TypeORM and moreover you are hiding the DB engine you are using.
Anyway to achieve this target you need write locks.
The doMagicAndIncreaseCount pseudocode should be something like
BEGIN TRANSACTION
ACQUIRE WRITE LOCK ON id
READ id RECORD
do computation
SAVE RECORD
CLOSE TRANSACTION
Alternatively you have to use some operation which is natively atomic on the DB engine; ex. the INCR operation on Redis.
Edit:
Reading on TypeORM find documentation, I can suggest something like:
this.fooRepository.findOne({
where: { id },
lock: { mode: "pessimistic_write", version: 1 },
})
P.S. Looking at the tags of the question I would guess the used DB engine is PostgreSQL.

SPARK Cost of Initializing Database Connection in map / mapPartitions context

Examples borrowed from Internet, thanks to those with better insights.
The following can be found on various forums in relation to mapPartitions and map:
... Consider the case of Initializing a database. If we are using map() or
foreach(), the number of times we would need to initialize will be equal to
the no of elements in RDD. Whereas if we use mapPartitions(), the no of times
we would need to initialize would be equal to number of Partitions ...
Then there is this response:
val newRd = myRdd.mapPartitions(
partition => {
val connection = new DbConnection /*creates a db connection per partition*/
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
})
connection.close()
newPartition
})
So, my questions are after having read discussions on various items pertaining to this:
Whilst I can understand the performance improvement using mapPartitions in general, why would according to the first snippet of text, the database connection be called every time for each element of an RDD using map? I can't seem to find the right reason.
The same things does not happen with sc.textFile ... and reading into dataframes from jdbc connections. Or does it? I would be very surprised if this was so.
What am I missing...?
First of all this code is not correct. While it looks like an adaptation of the established pattern for foreachPartition it cannot be used with mapPartitions like this.
Remember that foreachPartition takes Iterator[_] and returns Iterator[_], where Iterator.map is lazy, so this code is closing connection before it is actually used.
To use some form of resource, which is initialized in mapPartitions, you'll have to use design your code in a way, that doesn't require explicit resource release.
the first snippet of text, the database connection be called every time for each element of an RDD using map? I can't seem to find the right reason.
Without the snippet in question the answer must be generic - map or foreach are not designed to handle external state. With the API shown your in your question you'd have to:
rdd.map(record => readMatchingFromDB(record, new DbConnection))
which in and obvious way creates connection for each element.
It is not impossible to use for example singleton connection pool, doing something similar to:
object Pool {
lazy val pool = ???
}
rdd.map(record => readMatchingFromDB(record, pool.getConnection))
but it is not always easy to to do it right (think about thread safety). And because connections and similar objects, cannot be in general serialized, we cannot just used closures.
In contrast foreachPartition pattern is both explicit and simple.
It is of course possible to force eager execution to make things work, for example:
val newRd = myRdd.mapPartitions(
partition => {
val connection = new DbConnection /*creates a db connection per partition*/
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
}).toList
connection.close()
newPartition.toIterator
})
but it is of course risky, can actually decrease performance.
The same things does not happen with sc.textFile ... and reading into dataframes from jdbc connections. Or does it?
Both operate using much lower API, but of course resources are not initialized for each record.
In my opinion, connection should be kept out and created just once before map and closed post task completion.
val connection = new DbConnection /creates a db connection per partition/
val newRd = myRdd.mapPartitions(
partition => {
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
})
newPartition
})
connection.close()

Spark : How to make calls to database using foreachPartition

We have spark streaming job ..writing data to AmazonDynamoDB using foreachRDD but it is very slow with our consumption rate at 10,000/sec and writing 10,000 takes 35min ...this is the code piece ..
tempRequestsWithState is Dstream
tempRequestsWithState.foreachRDD { rdd =>
if ((rdd != null) && (rdd.count() > 0) && (!rdd.isEmpty()) ) {
rdd.foreachPartition {
case (topicsTableName, hashKeyTemp, attributeValueUpdate) => {
val client = new AmazonDynamoDBClient()
val request = new UpdateItemRequest(topicsTableName, hashKeyTemp, attributeValueUpdate)
try client.updateItem(request)
catch {
case se: Exception => println("Error executing updateItem!\nTable ", se)
}
}
case null =>
}
}
}
From research learnt that using foreachpartition and creating a connection per partition will help ..but not sure how to go about writing code for it ..will greatly appreciate if someone can help with this ...Also any other suggestion to speed up writing is greatly appreciated
Each partition is handled by an executor(a jvm process). So inside that you can write code to initialize db connection and write to database. IN the code given, the line for first case() is where you write that code. So as you get more partitions, if you have multiple executors writing to db, this will be done in parallel.
rdd.foreachPartition {
//initialize database cnx
//write to db
//close connection
}
It is better to use a single partition to write in a db and singleton to initialize cnx, to decrease the numbers of db connection, in foreachPartition function use write with
batch to increase the numbers of the inserted lines.
rdd.repartition(1).foreachPartition {
//get singleton instance cnx
//write with batche to db
//close connection
}

Getting Data from EventHub is delayed

I have an EventHub configured in Azure, also a consumer group for reading the data. It was working fine for some days. Suddenly, I see there is a delay in incoming data(around 3 days). I use Windows Service to consume data in my server. I have around 500 incoming messages per minute. Can anyone help me out to figure this out ?
It might be that you are processing them items too slow. Therefore the work to be done grows and you will lag behind.
To get some insight in where you are in the event stream you can use code like this:
private void LogProgressRecord(PartitionContext context)
{
if (namespaceManager == null)
return;
var currentSeqNo = context.Lease.SequenceNumber;
var lastSeqNo = namespaceManager.GetEventHubPartition(context.EventHubPath, context.ConsumerGroupName, context.Lease.PartitionId).EndSequenceNumber;
var delta = lastSeqNo - currentSeqNo;
logWriter.Write(
$"Last processed seqnr for partition {context.Lease.PartitionId}: {currentSeqNo} of {lastSeqNo} in consumergroup '{context.ConsumerGroupName}' (lag: {delta})",
EventLevel.Informational);
}
the namespaceManager is build like this:
namespaceManager = NamespaceManager.CreateFromConnectionString("Endpoint=sb://xxx.servicebus.windows.net/;SharedAccessKeyName=yyy;SharedAccessKey=zzz");
I call this logging method in the CloseAsync method:
public Task CloseAsync(PartitionContext context, CloseReason reason)
{
LogProgressRecord(context);
return Task.CompletedTask;
}
logWriter is just some logging class I have used to write info to blob storage.
It now outputs messages like
Last processed seqnr for partition 3: 32780931 of 32823804 in consumergroup 'telemetry' (lag: 42873)
so when the lag is very high you could be processing events that have occurred a long time ago. In that case you need to scale up/out your processor.
If you notice a lag you should measure how long it takes to process a given number of item. You can then try to optimize performance and see whether it improves. We did it like:
public async Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> events)
{
try
{
stopwatch.Restart();
// process items here
stopwatch.Stop();
await CheckPointAsync(context);
logWriter.Write(
$"Processed {events.Count()} events in {stopwatch.ElapsedMilliseconds}ms using partition {context.Lease.PartitionId} in consumergroup {context.ConsumerGroupName}.",
EventLevel.Informational);
}
}

Changing State when Using Scala Concurrency

I have a function in my Controller that takes user input, and then, using an infinite loop, queries a database and sends the object returned from the database to a webpage. This all works fine, except that I needed to introduce concurrency in order to both run this logic and render the webpage.
The code is given by:
def getSearchResult = Action { request =>
val search = request.queryString.get("searchInput").head.head
val databaseSupport = new InteractWithDatabase(comm, db)
val put = Future {
while (true) {
val data = databaseSupport.getFromDatabase(search)
if (data.nonEmpty) {
if (data.head.vendorId.equals(search)) {
comm.communicator ! data.head
}
}
}
}
Ok(views.html.singleElement.render)
}
The issue arises when I want to call this again, but with a different input. Because the first thread is in an infinite loop, it never ceases to run and is still running even when I start the second thread. Therefore, both objects are being sent to the webpage at the same time in two separate threads.
How can I stop the first thread once I call this function again? Or, is there a better implementation of this whole idea so that I could do it without using multithreading?
Note: I tried removing the concurrency from this function (as multithreading has been the thing giving me all of these problems) and instead moving it to the web socket itself, but this posed problems as the web socket is connected to a router, and everything connects to the web socket through the router.
Try AsyncAction where you return a Future[Result] as a result. Make database call in side this result. E.g.(pseudo code),
def getSearchResult = AsyncAction { request =>
val search = request.queryString.get("searchInput").head.head
val databaseSupport = new InteractWithDatabase(comm, db)
Future {
val data = databaseSupport.getFromDatabase(search)
if (data.nonEmpty) {
if (data.head.vendorId.equals(search)) {
comm.communicator ! data.head // A
}
}
Ok(views.html.singleElement.render)
}
}
Better if databaseSupport.getFromDatabase(search) returns a Future but that is a story for another day. The tricky part is to figure how to deal with Actor at "A". Just remember at the exit it must return Future[Result] result type.

Resources