I am trying to Spark to Oracle. If my connection fails, job is failing. Instead, I want to set some connection retry limit to ensure its trying to reconnect as per the limit and then fail the job if its not connecting.
Please suggest on how we could implement this.
Let's assume you are using PySpark. Recently I used this in my project so I know this works.
I have used retry PyPi project
retry 0.9.2
and its application passed through extensive testing process
I used a Python class to hold the retry related configurations.
class RetryConfig:
retry_count = 1
delay_interval = 1
backoff_multiplier = 1
I collected the application parameter from runtime configurations and set them as below:
RetryConfig.retry_count = <retry_count supplied from config>
RetryConfig.delay_interval = <delay_interval supplied from config>
RetryConfig.backoff_multiplier = <backoff_multiplier supplied from config>
Then applied the on the method call that connects the DB
#retry((Exception), tries=RetryConfig.retry_count, delay=RetryConfig.delay_interval, backoff=RetryConfig.backoff_multiplier)
def connect(connection_string):
print("trying")
obj = pyodbc.connect(connection_string)
return obj
Backoff will increase the delay by backoff multiplication factor with each retry - a quite common functional ask.
Cheers!!
Related
I have 3 node Spanner instance, and a single table that contains around 4 billion rows. The DDL looks like this:
CREATE TABLE predictions (
name STRING(MAX),
...,
model_version INT64,
) PRIMARY KEY (name, model_version)
I'd like to setup a job to periodically remove some old rows from this table using the Python Spanner client. The query I'd like to run is:
DELETE FROM predictions WHERE model_version <> ?
According to the docs, it sounds like I would need to execute this as a Partitioned DML statement. I am using the Python Spanner client as follows, but am experiencing timeouts (504 Deadline Exceeded errors) due to the large number of rows in my table.
# this always throws a "504 Deadline Exceeded" error
database.execute_partitioned_dml(
"DELETE FROM predictions WHERE model_version <> #version",
params={"model_version": 104},
param_types={"model_version": Type(code=INT64)},
)
My first intuition was to see if there was some sort of timeout I could increase, but I don't see any timeout parameters in the source :/
I did notice there was a run_in_transaction method in the Spanner lib that contains a timeout parameter, so I decided to deviate from the partitioned DML approach to see if using this method worked. Here's what I ran:
def delete_old_rows(transaction, model_version):
delete_dml = "DELETE FROM predictions WHERE model_version <> {}".format(model_version),
dml_statements = [
delete_dml,
]
status, row_counts = transaction.batch_update(dml_statements)
database.run_in_transaction(delete_old_rows,
model_version=104,
timeout_secs=3600,
)
What's weird about this is the timeout_secs parameter appears to be ignored, because I still get a 504 Deadline Exceeded error within a minute or 2 of executing the above code, despite a timeout of one hour.
Anyways, I'm not too sure what to try next, or whether or not I'm missing something obvious that would allow me to run a delete query in a timely fashion on this huge Spanner table. The model_version column has pretty low cardinality (generally 2-3 unique model_version values in the entire table), so I'm not sure if that would factor into any recommendations. But if someone could offer some advice or suggestions, that would be awesome :) Thanks in advance
The reason that setting timeout_secs didn't help was because the argument is unfortunately not the timeout for the transaction. It's the retry timeout for the transaction so it's used to set the deadline after which the transaction will stop being retried.
We will update the docs for run_in_transaction to explain this better.
The root cause was that the total timeout for the Streaming RPC calls was set too low in the client libraries, being set to 120s for Streaming APIs (eg ExecuteStreamingSQL used by partitioned DML calls.)
This has been fixed in the client library source code, changing them to a 60 minute timout (which is the maximum), and will be part of the next client library release.
As a workaround, in Java, you can configure the timeouts as part of the SpannerOptions when you connect your database. (I do not know how to set custom timeouts in Python, sorry)
final RetrySettings retrySettings =
RetrySettings.newBuilder()
.setInitialRpcTimeout(Duration.ofMinutes(60L))
.setMaxRpcTimeout(Duration.ofMinutes(60L))
.setMaxAttempts(1)
.setTotalTimeout(Duration.ofMinutes(60L))
.build();
SpannerOptions.Builder builder =
SpannerOptions.newBuilder()
.setProjectId("[PROJECT]"));
builder
.getSpannerStubSettingsBuilder()
.applyToAllUnaryMethods(
new ApiFunction<UnaryCallSettings.Builder<?, ?>, Void>() {
#Override
public Void apply(Builder<?, ?> input) {
input.setRetrySettings(retrySettings);
return null;
}
});
builder
.getSpannerStubSettingsBuilder()
.executeStreamingSqlSettings()
.setRetrySettings(retrySettings);
builder
.getSpannerStubSettingsBuilder()
.streamingReadSettings()
.setRetrySettings(retrySettings);
Spanner spanner = builder.build().getService();
The first suggestion is to try gcloud instead.
https://cloud.google.com/spanner/docs/modify-gcloud#modifying_data_using_dml
Another suggestion is to pass the range of name as well so that limit the number of rows scanned. For example, you could add something like STARTS_WITH(name, 'a') to the WHERE clause so that make sure each transaction touches a small amount of rows but first, you will need to know about the domain of name column values.
Last suggestion is try to avoid using '<>' if possible as it is generally pretty expensive to evaluate.
My python application uses twisted, and uses cassandra python driver under the hood. Cassandra python driver can use cassandra.io.twistedreactor.TwistedConnection as a connection class to use twisted as a way to query.
TwistedConnection class uses timer and reactor.callLater to check if a query task has timed out.
The problem is when I use cassandra ORM (cassandra.cqlengine.models.Model) to query.
from cassandra.cqlengine import columns
from cassandra.cqlengine.models import Model
# ORM for user settings
class UserSettings(Model):
userid = columns.Text(primary_key=True)
settings = columns.Text()
# Function registered with autobahn/wamp
def worker():
userid = "96c5d462-cf7c-11e7-b567-b8e8563d0920"
def _query():
# This is a blocking call, internally calling twisted reactor
# to collect the query result
setting = model.UserSettings.objects(userid=userid).get()
return json.loads(setting.settings)
threads.deferToThread(_query)
When run in twisted.trial unit tests. The test that uses above code always fails with
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was
unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x10e0a2dd8 [9.98250699043274s] called=0 cancelled=0 TwistedLoop._on_loop_timer()
In the autobahn worker where this code used, however works fine.
The cassandra driver code for TwistedConnection, keeps on calling callLater, and I could not find a way to find if any of these calls are still pending, as these calls are hidden in the TwistedLoop class.
Questions:
Is this correct way of handling cassandra query (which in turn would call twisted reactor)
If yes, is there a way to address DelayedCall resulting from cassandra driver timeout (reactor.callLater).
Just my understanding:
Maybe you will need to call .filter function while filtering? as mentioned in docs setting = model.UserSettings.objects.filter(userid=userid).get()
Maybe work around by change response time in Cassandra conf yaml file?
I'm developing an scala + akka app as part as a bigger application. The purpose of the app is to call external services and SQL databases (using JDBC), do some processing, and return a parsed result, on a recurrent basis. The app uses akka cluster so that it can scale horizontally.
How it should work
I'm creating a **singleton actor* on the cluster who's responsible for sending instructions to a pool of instruction handlers actors. I'm receiving events from a Redis pub/sub channel that state which datasources should be refreshed and how often. This SourceScheduler actor stores in an internal Array the instruction along with the interval.
Then I'm using akka Scheduler to execute a tick function every second. This function filters the array to determine which instructions need to be executed, and sends messages to the instructions handlers pool. The routees in the pool execute the instructions and emit the results through Redis Pub/Sub
The issue
On my machine (Ryzen 7 + 16GB RAM + ArchLinux) everything runs fine and we're processing easily 2500 database calls/second. But once in production, I cannot get it to process more than ~400 requests/s.
The SourceScheduler doesn't tick every second, and messages get stuck in the mailbox. Also, the app uses more CPU resources, and way more RAM (1.3GB in production vs ~350MB on my machine)
The production app runs in a JRE-8 alpine-based Docker container on Rancher, on a MS Azure server.
I understand that singleton actors on clusters can be a bottleneck, but since it only forwards messages to other actors I don't see how it could block.
What I've tried
I use Tomcat JDBC as connection pool manager for SQL queries. I'm sure I don't leak any a connection for I log every connection that is borrowed from the pool and every connection that returns to it
Blocking operations like JDBC queries are all executed on a separate dispatcher, a fixed thread pool executer with 500 threads, so all other actors should run properly
I've also given the SourceScheduler actor a dedicated pinned dispatcher so it should run on it's own thread
I've tried running the app in cluster with 3 nodes, with no performance improvement. Since the SourceScheduler is a singleton, running multiple nodes does not resolve the issue
I've tried the app on my coworker's machine. Works like a charm. I'm only experiencing issues with the production server
I've tried upgrading the production server to the most powerful available on Azure (16 cores, 2.3ghz) with no noticeable change
As anyone ever experienced such differences between their local machine and the production server ?
EDIT SourceScheduler.scala
class SourceScheduler extends Actor with ActorLogging with Timers {
case object Tick
case object SchedulerReport
import context.dispatcher
val instructionHandlerPool = context.actorOf(
ClusterRouterGroup(
RoundRobinGroup(Nil),
ClusterRouterGroupSettings(
totalInstances = 10,
routeesPaths = List("/user/instructionHandler"),
allowLocalRoutees = true
)
).props(),
name = "instructionHandlerRouter")
var ticks: Int = 0
var refreshedSources: Int = 0
val maxTicks: Int = Int.MaxValue - 1
var scheduledSources = Array[(String, Int, String)]()
override def preStart(): Unit = {
log.info("Starting Scheduler")
}
def refreshSource(hash: String) = {
instructionHandlerPool ! Instruction(hash)
refreshedSources += 1
}
// Get sources that neeed to be refreshed
def getEligibleSources(sources: Seq[(String, Int, String)], tick: Int) = {
sources.groupBy(_._1).mapValues(_.toList.minBy(_._2)).values.filter(tick * 1000 % _._2 == 0).map(_._1)
}
def tick(): Unit = {
ticks += 1
log.debug("Scheduler TICK {}", ticks)
val eligibleSources = getEligibleSources(scheduledSources, ticks)
val chunks = eligibleSources.grouped(ConnectionPoolManager.connectionPoolSize).zipWithIndex.toList
log.debug("Scheduling {} sources in {} chunks", eligibleSources.size, chunks.size)
chunks.foreach({
case(sources, index) =>
after((index * 25 + 5) milliseconds, context.system.scheduler)(Future.successful {
sources.foreach(refreshSource)
})
})
if(ticks >= maxTicks) ticks = 0
}
timers.startPeriodicTimer("schedulerTickTimer", Tick, 990 milliseconds)
timers.startPeriodicTimer("schedulerReportTimer", SchedulerReport, 10 seconds)
def receive: Receive = {
case AttachSource(hash, interval, socketId) =>
scheduledSources.synchronized {
scheduledSources = scheduledSources :+ ((hash, interval, socketId))
}
case DetachSource(socketId) =>
scheduledSources.synchronized {
scheduledSources = scheduledSources.filterNot(_._3 == socketId)
}
case SchedulerReport =>
log.info("{} sources were scheduled since last report", refreshedSources)
refreshedSources = 0
case Tick => tick()
case _ =>
}
}
Each source has is determined by a hash containing all required data for the execution (like the host of the database for example), the refresh interval, and the unique id of the client that asked for it so we can stop refreshing when the client disconnects.
Each second, we check if the source needs to be refreshed by applying a modulo with the current value of the ticks counter.
We refresh sources in smaller chunks to avoid connection pool starvation
The problem is that under a small load (~300 rq/s) the tick function is no longer executed every second
It turns out the issue was with Rancher.
We did several tests and the app was running fine on the machine directly, and on docker, but not when using Rancher as the orchestrator. I'm not sure why but since it's not related to Akka I'm closing the issue.
Thanks everyone for your help.
Maybe the bottleneck is on the network latency? In your machine all components are running side by side and communication should have no latency but in the cluster, if you are making a high number of database calls from one machine to another the network latency may be noticeable.
Recently I have been trying to use Mongoengine and Flask with a Replica set. I can connect but, when the primary node changes, connection is lost and there is a break.
Here's a snippet where you can test the behavior. It is using the very useful http://flip-flop.mlab.com/ site to debug replica-set problems
from flask import Flask
from mongoengine import connect
from flask_mongoengine import MongoEngine
import os
db = MongoEngine()
app = Flask(__name__)
class TestDoc(db.Document):
texto = db.StringField()
class ProductionConfig:
def get_conn_data(self):
conn = {
'host':"mongodb://testdbuser:testdbpass#flip.mongolab.com:53117,flop.mongolab.com:54117/testdb?replicaSet=rs-flip-flop",
'replicaSet': 'rs-flip-flop'
}
return conn
import time
app.config['MONGODB_SETTINGS'] = ProductionConfig().get_conn_data()
db.init_app(app)
if __name__ == '__main__':
with app.test_client() as c:
while True:
time.sleep(1)
print(TestDoc.objects().count())
TestDoc(texto="1").save()
I get every time that the primary changes an error: pymongo.errors.AutoReconnect: connection closed
.
Many thanks! I have tried a couple of different pyMongo versions but without success. Any help will be really, really appreciated!
The issue here is that the election of a new primary is not instantaneous. From the docs:
It varies, but a replica set will generally select a new primary within a
minute.
For instance, it may take 10-30 seconds for the members of a replica
set to declare a primary inaccessible (see electionTimeoutMillis). One
of the remaining secondaries holds an election to elect itself as a
new primary. During the election, the cluster is unavailable for
writes.
The election itself may take another 10-30 seconds.
In the time between the primary going down and the replica being elected as the new primary then there is no connection that will accept writes (because they have to go to primaries).
However, there are some things you can do to your code to make it more resilient in these situations.
Firstly, you should set a read preference on the connection (more info here):
conn = {
'host':"mongodb://testdbuser:testdbpass#flip.mongolab.com:53117,flop.mongolab.com:54117/testdb",
'replicaSet': 'rs-flip-flop',
'read_preference': ReadPreference.SECONDARY_PREFERRED
}
This means that reads should be pretty robust during the election.
Unfortunately, short of wrapping all of your writes in try blocks your code will fall over if it is trying to write during the election.
This should be less of a problem than it seems in your question's example because (assuming you are doing your writes in a flask route) the webserver will throw a 500 error response. By the time you request the route again from flask the election should be complete and mongoengine will be writing to the new primary.
We use Slick (2.1.0) with Spray-io (1.3.3). Currently we are facing an issue because we use the same execution context for both the Spray HTTP API part and background running jobs accessing the same database. All database / blocking calls are wrapped in futures using the same scala.concurrent.ExecutionContext.global execution context.
When the background jobs start doing heavy work, they'll consume all available threads, which will lead to timeouts on the API side since their aren't any available threads to handle the API work.
The obvious solution would be to use different execution contexts for both parts with a total thread count not higher than the configured DB connection pool (HikariCP). (as partially suggested here https://www.playframework.com/documentation/2.1.0/ThreadPools#Many-specific-thread-pools) but how would such a setup work with Slick 3 where the execution context is tied to the DB configuration itself?
Slick3 comes with own execution context and number of threads are configurable.You can tweak all the connection pool setting for example(MySQL):
dev-dbconf={
dataSourceClass = "com.mysql.jdbc.jdbc2.optional.MysqlDataSource"
numThreads = 10 //for execution context
maxConnections = 10
minConnections = 5
connectionTimeout = 10000
initializationFailFast = false
properties {
user = "root"
password = "root"
databaseName = "db_name"
serverName = "localhost"
}
}
In this config you can change number of thread according your requirement.
I would like advise you never used "scala.concurrent.ExecutionContext.global" for IO. Because default ExecutionContext comes with fork-join thread pool which is not good for IO.You can create own thread pool for IO:
import scala.concurrent.ExecutionContext
import java.util.concurrent.Executors
object MyExecutionContext {
private val concorrency = Runtime.getRuntime.availableProcessors()
private val factor = 3 // get from configuration file
private val noOfThread = concorrency * factor
implicit val ioThreadPool: ExecutionContext = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(noOfThread))
}
// Use this execution context for IO instead of scala execution context.
import MyExecutionContext.ioThreadPool
Future{
// your blocking IO code
}
You can change noOfThread according to your requirement. It would be good if you set number of thread according number processors in your machine.
For more info, your can see Best Practices for Using Slick on Production
and Slick Doc.