MongoEngine ReplicaSet Connection fails on stepdown - python-3.x

Recently I have been trying to use Mongoengine and Flask with a Replica set. I can connect but, when the primary node changes, connection is lost and there is a break.
Here's a snippet where you can test the behavior. It is using the very useful http://flip-flop.mlab.com/ site to debug replica-set problems
from flask import Flask
from mongoengine import connect
from flask_mongoengine import MongoEngine
import os
db = MongoEngine()
app = Flask(__name__)
class TestDoc(db.Document):
texto = db.StringField()
class ProductionConfig:
def get_conn_data(self):
conn = {
'host':"mongodb://testdbuser:testdbpass#flip.mongolab.com:53117,flop.mongolab.com:54117/testdb?replicaSet=rs-flip-flop",
'replicaSet': 'rs-flip-flop'
}
return conn
import time
app.config['MONGODB_SETTINGS'] = ProductionConfig().get_conn_data()
db.init_app(app)
if __name__ == '__main__':
with app.test_client() as c:
while True:
time.sleep(1)
print(TestDoc.objects().count())
TestDoc(texto="1").save()
I get every time that the primary changes an error: pymongo.errors.AutoReconnect: connection closed
.
Many thanks! I have tried a couple of different pyMongo versions but without success. Any help will be really, really appreciated!

The issue here is that the election of a new primary is not instantaneous. From the docs:
It varies, but a replica set will generally select a new primary within a
minute.
For instance, it may take 10-30 seconds for the members of a replica
set to declare a primary inaccessible (see electionTimeoutMillis). One
of the remaining secondaries holds an election to elect itself as a
new primary. During the election, the cluster is unavailable for
writes.
The election itself may take another 10-30 seconds.
In the time between the primary going down and the replica being elected as the new primary then there is no connection that will accept writes (because they have to go to primaries).
However, there are some things you can do to your code to make it more resilient in these situations.
Firstly, you should set a read preference on the connection (more info here):
conn = {
'host':"mongodb://testdbuser:testdbpass#flip.mongolab.com:53117,flop.mongolab.com:54117/testdb",
'replicaSet': 'rs-flip-flop',
'read_preference': ReadPreference.SECONDARY_PREFERRED
}
This means that reads should be pretty robust during the election.
Unfortunately, short of wrapping all of your writes in try blocks your code will fall over if it is trying to write during the election.
This should be less of a problem than it seems in your question's example because (assuming you are doing your writes in a flask route) the webserver will throw a 500 error response. By the time you request the route again from flask the election should be complete and mongoengine will be writing to the new primary.

Related

Unable to delete large number of rows from Spanner

I have 3 node Spanner instance, and a single table that contains around 4 billion rows. The DDL looks like this:
CREATE TABLE predictions (
name STRING(MAX),
...,
model_version INT64,
) PRIMARY KEY (name, model_version)
I'd like to setup a job to periodically remove some old rows from this table using the Python Spanner client. The query I'd like to run is:
DELETE FROM predictions WHERE model_version <> ?
According to the docs, it sounds like I would need to execute this as a Partitioned DML statement. I am using the Python Spanner client as follows, but am experiencing timeouts (504 Deadline Exceeded errors) due to the large number of rows in my table.
# this always throws a "504 Deadline Exceeded" error
database.execute_partitioned_dml(
"DELETE FROM predictions WHERE model_version <> #version",
params={"model_version": 104},
param_types={"model_version": Type(code=INT64)},
)
My first intuition was to see if there was some sort of timeout I could increase, but I don't see any timeout parameters in the source :/
I did notice there was a run_in_transaction method in the Spanner lib that contains a timeout parameter, so I decided to deviate from the partitioned DML approach to see if using this method worked. Here's what I ran:
def delete_old_rows(transaction, model_version):
delete_dml = "DELETE FROM predictions WHERE model_version <> {}".format(model_version),
dml_statements = [
delete_dml,
]
status, row_counts = transaction.batch_update(dml_statements)
database.run_in_transaction(delete_old_rows,
model_version=104,
timeout_secs=3600,
)
What's weird about this is the timeout_secs parameter appears to be ignored, because I still get a 504 Deadline Exceeded error within a minute or 2 of executing the above code, despite a timeout of one hour.
Anyways, I'm not too sure what to try next, or whether or not I'm missing something obvious that would allow me to run a delete query in a timely fashion on this huge Spanner table. The model_version column has pretty low cardinality (generally 2-3 unique model_version values in the entire table), so I'm not sure if that would factor into any recommendations. But if someone could offer some advice or suggestions, that would be awesome :) Thanks in advance
The reason that setting timeout_secs didn't help was because the argument is unfortunately not the timeout for the transaction. It's the retry timeout for the transaction so it's used to set the deadline after which the transaction will stop being retried.
We will update the docs for run_in_transaction to explain this better.
The root cause was that the total timeout for the Streaming RPC calls was set too low in the client libraries, being set to 120s for Streaming APIs (eg ExecuteStreamingSQL used by partitioned DML calls.)
This has been fixed in the client library source code, changing them to a 60 minute timout (which is the maximum), and will be part of the next client library release.
As a workaround, in Java, you can configure the timeouts as part of the SpannerOptions when you connect your database. (I do not know how to set custom timeouts in Python, sorry)
final RetrySettings retrySettings =
RetrySettings.newBuilder()
.setInitialRpcTimeout(Duration.ofMinutes(60L))
.setMaxRpcTimeout(Duration.ofMinutes(60L))
.setMaxAttempts(1)
.setTotalTimeout(Duration.ofMinutes(60L))
.build();
SpannerOptions.Builder builder =
SpannerOptions.newBuilder()
.setProjectId("[PROJECT]"));
builder
.getSpannerStubSettingsBuilder()
.applyToAllUnaryMethods(
new ApiFunction<UnaryCallSettings.Builder<?, ?>, Void>() {
#Override
public Void apply(Builder<?, ?> input) {
input.setRetrySettings(retrySettings);
return null;
}
});
builder
.getSpannerStubSettingsBuilder()
.executeStreamingSqlSettings()
.setRetrySettings(retrySettings);
builder
.getSpannerStubSettingsBuilder()
.streamingReadSettings()
.setRetrySettings(retrySettings);
Spanner spanner = builder.build().getService();
The first suggestion is to try gcloud instead.
https://cloud.google.com/spanner/docs/modify-gcloud#modifying_data_using_dml
Another suggestion is to pass the range of name as well so that limit the number of rows scanned. For example, you could add something like STARTS_WITH(name, 'a') to the WHERE clause so that make sure each transaction touches a small amount of rows but first, you will need to know about the domain of name column values.
Last suggestion is try to avoid using '<>' if possible as it is generally pretty expensive to evaluate.

Oracle Connection retry using spark

I am trying to Spark to Oracle. If my connection fails, job is failing. Instead, I want to set some connection retry limit to ensure its trying to reconnect as per the limit and then fail the job if its not connecting.
Please suggest on how we could implement this.
Let's assume you are using PySpark. Recently I used this in my project so I know this works.
I have used retry PyPi project
retry 0.9.2
and its application passed through extensive testing process
I used a Python class to hold the retry related configurations.
class RetryConfig:
retry_count = 1
delay_interval = 1
backoff_multiplier = 1
I collected the application parameter from runtime configurations and set them as below:
RetryConfig.retry_count = <retry_count supplied from config>
RetryConfig.delay_interval = <delay_interval supplied from config>
RetryConfig.backoff_multiplier = <backoff_multiplier supplied from config>
Then applied the on the method call that connects the DB
#retry((Exception), tries=RetryConfig.retry_count, delay=RetryConfig.delay_interval, backoff=RetryConfig.backoff_multiplier)
def connect(connection_string):
print("trying")
obj = pyodbc.connect(connection_string)
return obj
Backoff will increase the delay by backoff multiplication factor with each retry - a quite common functional ask.
Cheers!!

PYODBC Connection Not Closing

I'm using threading to execute multiple SQL queries simultaneously. I first append the connections to a list called connections as such:
import pyodbc
connections = []
num_connnections = 100
for i in range(num_connections):
connections.append(pyodbc.connect('connection_string'))
This works and so does threading multiple queries. However, if I run this process many times, I get the error
('HY000', '[HY000] [Oracle][ODBC][Ora]ORA-12519: TNS:no appropriate service handler found
(12519) (SQLDriverConnect); [HY000] [Oracle][ODBC][Ora]ORA-12519: TNS:no appropriate service handler found (12519)')
I'm fairly sure this is because the number of ODBC connections becomes too high. When I try closing them with
pyodbc.pooling = False
for i in connections:
i.close()
del i
del connections
the list connections is deleted. However, it doesn't appear it closed any connections because the next time I run pyodbc.connect('connection_string') I immediately get the same error. Any ideas on what this could be?

Using cassandra python driver with twisted

My python application uses twisted, and uses cassandra python driver under the hood. Cassandra python driver can use cassandra.io.twistedreactor.TwistedConnection as a connection class to use twisted as a way to query.
TwistedConnection class uses timer and reactor.callLater to check if a query task has timed out.
The problem is when I use cassandra ORM (cassandra.cqlengine.models.Model) to query.
from cassandra.cqlengine import columns
from cassandra.cqlengine.models import Model
# ORM for user settings
class UserSettings(Model):
userid = columns.Text(primary_key=True)
settings = columns.Text()
# Function registered with autobahn/wamp
def worker():
userid = "96c5d462-cf7c-11e7-b567-b8e8563d0920"
def _query():
# This is a blocking call, internally calling twisted reactor
# to collect the query result
setting = model.UserSettings.objects(userid=userid).get()
return json.loads(setting.settings)
threads.deferToThread(_query)
When run in twisted.trial unit tests. The test that uses above code always fails with
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was
unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x10e0a2dd8 [9.98250699043274s] called=0 cancelled=0 TwistedLoop._on_loop_timer()
In the autobahn worker where this code used, however works fine.
The cassandra driver code for TwistedConnection, keeps on calling callLater, and I could not find a way to find if any of these calls are still pending, as these calls are hidden in the TwistedLoop class.
Questions:
Is this correct way of handling cassandra query (which in turn would call twisted reactor)
If yes, is there a way to address DelayedCall resulting from cassandra driver timeout (reactor.callLater).
Just my understanding:
Maybe you will need to call .filter function while filtering? as mentioned in docs setting = model.UserSettings.objects.filter(userid=userid).get()
Maybe work around by change response time in Cassandra conf yaml file?

Fetch data from a large azure table and how to avoid timeout error?

I am trying to fetch data from a large Azure Table and after few hours running into the following error:
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
The following is my code :
from azure.storage import TableService,Entity
from azure import *
import json
from datetime import datetime as dt
from datetime import timezone, timedelta
ts=TableService(account_name='dev',account_key='key')
i=0
next_pk=None
next_rk=None
N=10
date_N_days_ago = datetime.now(timezone.utc) -timedelta(days=N)
while True:
entities=ts.query_entities('Events',next_partition_key=next_pk,next_row_key=next_rk,top=1000)
i+=1
with open('blobdata','a') as fil:
for entity in entities:
if (entity.Timestamp) > date_N_days_ago:
fil.write(str(entity.DetailsJSON)+'\n')
with open('1k_data','a') as fil2:
if i%5000==0:
fil2.write('{}|{}|{}|{}'.format(i,entity.PartitionKey, entity.Timestamp,entity.DetailsJSON+'\n'))
if hasattr(entities,'x_ms_continuation'):
x_ms_continuation=getattr(entities,'x_ms_continuation')
next_pk=x_ms_continuation['nextpartitionkey']
next_rk=x_ms_continuation['nextrowkey']
else:
break;
Also, if someone has a better idea of how to achieve this process in a better fashion please do tell as the table is very large and the code is taking too long to process.
This exception can happen in all sorts of network calls on occasion. It should be entirely transient. I would recommend simply catching the error, waiting a little bit, and trying again.
The Azure Storage Python Library recently moved and we will be doing a ton of improvements on it in the coming months including built-in retry policies. So, in the future the library itself will retry these sorts of errors for you.
In general if you want to make this faster you could try adding some multithreading to the processing of your entities. Even parallelizing writing to the two different files could really help.

Resources