Using cassandra python driver with twisted - python-3.x

My python application uses twisted, and uses cassandra python driver under the hood. Cassandra python driver can use cassandra.io.twistedreactor.TwistedConnection as a connection class to use twisted as a way to query.
TwistedConnection class uses timer and reactor.callLater to check if a query task has timed out.
The problem is when I use cassandra ORM (cassandra.cqlengine.models.Model) to query.
from cassandra.cqlengine import columns
from cassandra.cqlengine.models import Model
# ORM for user settings
class UserSettings(Model):
userid = columns.Text(primary_key=True)
settings = columns.Text()
# Function registered with autobahn/wamp
def worker():
userid = "96c5d462-cf7c-11e7-b567-b8e8563d0920"
def _query():
# This is a blocking call, internally calling twisted reactor
# to collect the query result
setting = model.UserSettings.objects(userid=userid).get()
return json.loads(setting.settings)
threads.deferToThread(_query)
When run in twisted.trial unit tests. The test that uses above code always fails with
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was
unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x10e0a2dd8 [9.98250699043274s] called=0 cancelled=0 TwistedLoop._on_loop_timer()
In the autobahn worker where this code used, however works fine.
The cassandra driver code for TwistedConnection, keeps on calling callLater, and I could not find a way to find if any of these calls are still pending, as these calls are hidden in the TwistedLoop class.
Questions:
Is this correct way of handling cassandra query (which in turn would call twisted reactor)
If yes, is there a way to address DelayedCall resulting from cassandra driver timeout (reactor.callLater).

Just my understanding:
Maybe you will need to call .filter function while filtering? as mentioned in docs setting = model.UserSettings.objects.filter(userid=userid).get()
Maybe work around by change response time in Cassandra conf yaml file?

Related

Why is pyspark implemented such that exiting a session is stopping the underlying spark context?

I just massively shot my foot by writing "pythonic" spark code like this:
# spark = ... getOrCreate() # essentially provided by the environment (Databricks)
with spark.newSession() as session:
session.catalog.setCurrentDatabase("foo_test")
do_something_within_database_scope(session)
assert spark.currentDatabase() == "default"
And oh was I surprised that when executing this notebook cell, somehow the cluster terminated.
I read through this answer which tells me, that there can only be one spark context. That is fine. But why is exiting a session terminating the underlying context? Is there some requirement for this or is this just a design flaw in pyspark?
I also understand that the session's __exit__ call invokes context.stop() - I want to know why it is implemented like that!
I always think of a session as some user initiated thing, like with databases or http clients which I can create and discard on my own will. If the session provides __enter__ and __exit__ then I try to use it from within a with context to make sure I clean up after I am done.
Is my understanding wrong, or alternatively why does pyspark deviate from that concept?
Edit: I tested this together with databricks-connect which comes with its own pyspark python module, but as pri pointed out below it seems to be implemented the same way in standard pyspark.
I looked at the code, and it calls below method:
#since(2.0)
def __exit__(
self,
exc_type: Optional[Type[BaseException]],
exc_val: Optional[BaseException],
exc_tb: Optional[TracebackType],
) -> None:
"""
Enable 'with SparkSession.builder.(...).getOrCreate() as session: app' syntax.
Specifically stop the SparkSession on exit of the with block.
"""
self.stop()
And the stop method is:
#since(2.0)
def stop(self) -> None:
"""Stop the underlying :class:`SparkContext`."""
from pyspark.sql.context import SQLContext
self._sc.stop()
# We should clean the default session up. See SPARK-23228.
self._jvm.SparkSession.clearDefaultSession()
self._jvm.SparkSession.clearActiveSession()
SparkSession._instantiatedSession = None
SparkSession._activeSession = None
SQLContext._instantiatedContext = None
So I don't think you can stop just the SparkSession. Whenever a Spark Session gets stopped (irrespective of the way, in this case, when it comes out of 'with' block, __exit__ is being called), it would kill the underlying SparkContext along with it.
Link to the relevant Apache Spark code below:
https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L1029

Oracle Connection retry using spark

I am trying to Spark to Oracle. If my connection fails, job is failing. Instead, I want to set some connection retry limit to ensure its trying to reconnect as per the limit and then fail the job if its not connecting.
Please suggest on how we could implement this.
Let's assume you are using PySpark. Recently I used this in my project so I know this works.
I have used retry PyPi project
retry 0.9.2
and its application passed through extensive testing process
I used a Python class to hold the retry related configurations.
class RetryConfig:
retry_count = 1
delay_interval = 1
backoff_multiplier = 1
I collected the application parameter from runtime configurations and set them as below:
RetryConfig.retry_count = <retry_count supplied from config>
RetryConfig.delay_interval = <delay_interval supplied from config>
RetryConfig.backoff_multiplier = <backoff_multiplier supplied from config>
Then applied the on the method call that connects the DB
#retry((Exception), tries=RetryConfig.retry_count, delay=RetryConfig.delay_interval, backoff=RetryConfig.backoff_multiplier)
def connect(connection_string):
print("trying")
obj = pyodbc.connect(connection_string)
return obj
Backoff will increase the delay by backoff multiplication factor with each retry - a quite common functional ask.
Cheers!!

pyspark - using MatrixFactorizationModel in RDD's map function

I have this model:
from pyspark.mllib.recommendation import ALS
model = ALS.trainImplicit(ratings,
rank,
seed=seed,
iterations=iterations,
lambda_=regularization_parameter,
alpha=alpha)
I have successfully used it to recommend users to all product with the simple approach:
recRDD = model.recommendUsersForProducts(number_recs)
Now if I just want to recommend to a set of items, I first load the target items:
target_items = sc.textFile(items_source)
And then map the recommendUsers() function like this:
recRDD = target_items.map(lambda x: model.recommendUsers(int(x), number_recs))
This fails after any action I try, with the following error:
It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I'm trying this locally so I'm not sure if this error persists when on client or cluster mode. I have tried to broadcast the model, which only makes this same error when trying to broadcast instead.
Am I thinking straight? I could eventually just recommend for all and then filter, but I'm really trying to avoid recommending for every item due the large amount of them.
Thanks in advance!
I don't think there is a way to call recommendUsers from the workers because it ultimately calls callJavaFunc which needs the SparkContext as an argument. If target_items is sufficiently small you could call recommendUsers in a loop on the driver (this would be the opposite extreme of predicting for all users and then filtering).
Alternatively, have you looked at predictAll? Roughly speaking, you could run predictions for all users for the target items, and then rank them yourself.

MongoEngine ReplicaSet Connection fails on stepdown

Recently I have been trying to use Mongoengine and Flask with a Replica set. I can connect but, when the primary node changes, connection is lost and there is a break.
Here's a snippet where you can test the behavior. It is using the very useful http://flip-flop.mlab.com/ site to debug replica-set problems
from flask import Flask
from mongoengine import connect
from flask_mongoengine import MongoEngine
import os
db = MongoEngine()
app = Flask(__name__)
class TestDoc(db.Document):
texto = db.StringField()
class ProductionConfig:
def get_conn_data(self):
conn = {
'host':"mongodb://testdbuser:testdbpass#flip.mongolab.com:53117,flop.mongolab.com:54117/testdb?replicaSet=rs-flip-flop",
'replicaSet': 'rs-flip-flop'
}
return conn
import time
app.config['MONGODB_SETTINGS'] = ProductionConfig().get_conn_data()
db.init_app(app)
if __name__ == '__main__':
with app.test_client() as c:
while True:
time.sleep(1)
print(TestDoc.objects().count())
TestDoc(texto="1").save()
I get every time that the primary changes an error: pymongo.errors.AutoReconnect: connection closed
.
Many thanks! I have tried a couple of different pyMongo versions but without success. Any help will be really, really appreciated!
The issue here is that the election of a new primary is not instantaneous. From the docs:
It varies, but a replica set will generally select a new primary within a
minute.
For instance, it may take 10-30 seconds for the members of a replica
set to declare a primary inaccessible (see electionTimeoutMillis). One
of the remaining secondaries holds an election to elect itself as a
new primary. During the election, the cluster is unavailable for
writes.
The election itself may take another 10-30 seconds.
In the time between the primary going down and the replica being elected as the new primary then there is no connection that will accept writes (because they have to go to primaries).
However, there are some things you can do to your code to make it more resilient in these situations.
Firstly, you should set a read preference on the connection (more info here):
conn = {
'host':"mongodb://testdbuser:testdbpass#flip.mongolab.com:53117,flop.mongolab.com:54117/testdb",
'replicaSet': 'rs-flip-flop',
'read_preference': ReadPreference.SECONDARY_PREFERRED
}
This means that reads should be pretty robust during the election.
Unfortunately, short of wrapping all of your writes in try blocks your code will fall over if it is trying to write during the election.
This should be less of a problem than it seems in your question's example because (assuming you are doing your writes in a flask route) the webserver will throw a 500 error response. By the time you request the route again from flask the election should be complete and mongoengine will be writing to the new primary.

Best practice for Slick 2.1 / 3 execution context usage

We use Slick (2.1.0) with Spray-io (1.3.3). Currently we are facing an issue because we use the same execution context for both the Spray HTTP API part and background running jobs accessing the same database. All database / blocking calls are wrapped in futures using the same scala.concurrent.ExecutionContext.global execution context.
When the background jobs start doing heavy work, they'll consume all available threads, which will lead to timeouts on the API side since their aren't any available threads to handle the API work.
The obvious solution would be to use different execution contexts for both parts with a total thread count not higher than the configured DB connection pool (HikariCP). (as partially suggested here https://www.playframework.com/documentation/2.1.0/ThreadPools#Many-specific-thread-pools) but how would such a setup work with Slick 3 where the execution context is tied to the DB configuration itself?
Slick3 comes with own execution context and number of threads are configurable.You can tweak all the connection pool setting for example(MySQL):
dev-dbconf={
dataSourceClass = "com.mysql.jdbc.jdbc2.optional.MysqlDataSource"
numThreads = 10 //for execution context
maxConnections = 10
minConnections = 5
connectionTimeout = 10000
initializationFailFast = false
properties {
user = "root"
password = "root"
databaseName = "db_name"
serverName = "localhost"
}
}
In this config you can change number of thread according your requirement.
I would like advise you never used "scala.concurrent.ExecutionContext.global" for IO. Because default ExecutionContext comes with fork-join thread pool which is not good for IO.You can create own thread pool for IO:
import scala.concurrent.ExecutionContext
import java.util.concurrent.Executors
object MyExecutionContext {
private val concorrency = Runtime.getRuntime.availableProcessors()
private val factor = 3 // get from configuration file
private val noOfThread = concorrency * factor
implicit val ioThreadPool: ExecutionContext = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(noOfThread))
}
// Use this execution context for IO instead of scala execution context.
import MyExecutionContext.ioThreadPool
Future{
// your blocking IO code
}
You can change noOfThread according to your requirement. It would be good if you set number of thread according number processors in your machine.
For more info, your can see Best Practices for Using Slick on Production
and Slick Doc.

Resources