Database threading with Scala futures (Play2)

Database threading with Scala futures (Play2) - multithreading

I'm developing a Scala Play2 application that queries an OrientDB Graph in Scala Play2. Until today I didn't bother with indexes and all seemed to work fine but now that I have enabled a couple I get this error:
play.api.http.HttpErrorHandlerExceptions$$anon$1: Execution
exception[[ODatabaseException: Database instance is not set in current
thread. Assure to set it with:
ODatabaseRecordThreadLocal.INSTANCE.set(db);]]
From the documentation I understand that the database object is not thread safe but I'm uncertain how to proceed: my queries are picked up asynchronously by the Play2 executor pool and I'm not sure whether it would be a good idea to mess around with threadlocals. Will the driver block? Will the driver clobber its state if different threads from the pool handle the database connection? In any case I would like some advice from someone that knows Orient's driver architecture better than me :)

As suggested by the documentation and the error, the OrientDB driver uses thread locals to isolate thread-unsafe portions of itself.
The solution to this issue was to move away from the simplistic design I had at that point and run the queries from within an org.apache.tinkerpop.gremlin.orientdb.OrientGraph acquired through a org.apache.tinkerpop.gremlin.orientdb.OrientGraphFactory#getTx() called within the same block.
I didn't investigate whether a transaction is strictly required, although I don't think so.

Related

Is sharing one SQLite connection inside desktop app safe? Sharing one connection vs creating new connections for each query

I have found similar question on stackoverflow but it is solely focused on performance and answer is pretty obvious: creating new connection for each query = slower performance (how much slower? it depends)
I am more worried about transaction isolation aspect. In SQLite documentation I have found out that there is no isolation within a single connection. I am using sqlite3 library in my electron desktop app and I was planning on sharing a single connection throught the whole time that my app is running (to make it a little faster) but now I am wondering if it is safe. If there is no isolation within a single connection then is this scenario possible?:
Client triggers 2 unrelated processes
1.
db.serialize(()=>{
db.run("BEGIN");
try{
db.run("foo");
db.run("bar");
}catch(e){
db.run("ROLLBACK")
}
db.run("COMMIT")
});
2.
db.run("another foobar")
1. and 2. are ran parallel so it is possible that 2. finishes somewhere in between of "begin" and "commit"/"rollback" from 1.
Does that mean that it is possible for queries from 2. to be rolledback or commited by 1. even though they are entierly separate or is 2. using some implicit transaction to prevent this?
I think it is possible since there is no isolation within single connection but I might be missing something because I have never worked with SQLite and sqlite3 (or I might have missed something more basic) so I would like to confirm if this scenario is a potential danger of using single sqlite3 connection.

Using a single QSqlDatabase connection in multiple qt threads

I have a multi threaded Qt application which has multiple threads accessing a single database. Am I required create separate QSqlDatabase connections for performing SELECT / INSERT / UPDATE in each thread?
From Qt documentation, I am unable to understand if the following guideline is discouraging the above approach I suggested:
"A connection can only be used from within the thread that created it.
Moving connections between threads or creating queries from a
different thread is not supported."
I have practically tried using the same connection in my multiple QThreads and all works fine practically but wanted to understand if its the correct thing to do.
FYI, I am using sqlite3 from within Qt (using Qtsql API) which I understand supports serialized mode by
default: https://www.sqlite.org/threadsafe.html
The reason I want to use the same connection name in multiple threads is because when I tried using different connections to the same database on multiple threads and performed SELECT / INSERT / UPDATE, I got database locked issue quite frequently. However, on using the same connection in multiple threads, this issue got eliminated completely.
Kindly guide on the same.
Regards,
Saurabh Gandhi

The documentation is not merely discouraging it, it flatly states that you must not do it (emphasis mine):
A connection can only be used from within the thread that created it.
So, no, you can't use one connection from multiple threads. It might happen to work, but it's not guaranteed to work, and you're invoking what amounts to undefined behavior. It's not guaranteed to crash either, mind you.
You need to either:
Serialize the access to the database on your end, or
Change the connection parameters so that locks don't reject a query but block until the database becomes available. I'm not quite sure what the database locked "issue" is: you should never see that error code (I presume it is SQLITE_LOCKED) if you actually use multiple connections. Sqlite 3 can be easily used from multiple threads, it shouldn't require any effort on your end other than enabling multithreading and using separate connections.

Scaling and selecting unique records per worker

I'm part of a project where we are having to deal with a lot of data in a stream. It's going to be passed to Mongo and from there it needs to be processed by workers to see if it needs to be persisted, amongst other things, or discarded.
We want to scale this horizontally. My question is, what methods are there for ensuring that each worker selects a unique record, that isn't already being processed by another worker?
Is a central main worker required to hand out jobs to the sub workers, if that is the case, the bottle neck and point of failure is with that central worker, right?
Any ideas or suggestions welcome.
Thanks!
Josh

You can use findAndModify to both select and flag a document atomically, making sure that only one worker gets to process it. My experience is that this can be slow due to excessive database locking, but that experience is based on MongoDB 2.x so it may not be an issue anymore on 3.x.
Also, with MongoDB it's difficult to "wait" for new jobs/documents (you can tail the oplog, but you'd have to do this from every worker and each one will wake up and perform the findAndModify() query, resulting in the aforementioned locking).
I think that ultimately you should consider using a proper messaging solution (write data to MongoDB, write the _id to the broker, have the workers subscribe to the message queue, and if you configure things properly only one worker will get a job). Well-known brokers are RabbitMQ, nsq.io and with a bit of extra work you can even use Redis.

worker queue for nodejs?

I am in the process of beginning to write a worker queue for node using node's cluster API and mongoose.
I noticed that a lot of libs exist that already do this but using redis and forking. Is there a good reason to fork versus using the cluster API?
edit and now i also find this: https://github.com/xk/node-threads-a-gogo -- too many options!
I would rather not add redis to the mix since I already use mongo. Also, my requirements are very loose, I would like persistence but could go without it for the first version.
Part two of the question:
What are the most stable/used nodejs worker queue libs out there today?

Wanted to follow up on this. My solution ended up being a roll your own cluster impl where some of my cluster workers are dedicated job workers (ie they just have code to work on jobs).
I use agenda for job scheduling.
Cron type jobs are scheduled by the cluster master. The rest of the jobs are created in the non-worker clusters as they are needed. (verification emails etc)
Before that I was using kue but dropped it because the rest of my app uses mongodb and I didnt like having to use redis just for job scheduling.

Have u tried https://github.com/rvagg/node-worker-farm?
It is very light weight and doesn't require a separate server.

I personally am partial to cluster-master.
https://github.com/isaacs/cluster-master
The reason I like cluster master is because it does very little besides add in logic for forking your process, and give you the ability to manage the number of process you're running, and a little bit of logging/recovery to boot! I find overly bloated process management libraries tend to be unstable, and sometimes even slow things down.
This library will be good for you if the following are true:
Your module is largely asynchronous
You don't have a huge amount of different types of events triggering
The events that fire have small amounts of work to do, but you have lots of similar events firing(things like web servers)
The reason for the above list, is the reason why threads-a-gogo may be good for you, for the opposite reasons. If you have a few spots in your code, where there is a lot of work to do within your event loop, something like threads-a-gogo that launches a "thread" specifically for this work is awesome, because you aren't determining ahead of time how many workers to spawn, but rather spawning them to do work when needed. Note: this can also be bad if there is the potential for a lot of them to spawn, if you start launching too many processes things can actually bog down, but I digress.
To summarize, if your module is largely asynchronous already, what you really want is a worker pool. To minimize the down time when your process is not listening for events, and to maximize the amount of processor you can use. Unless you have a very busy syncronous call, a single node event loop will have troubles taking advantage of even a single core of a processor. Under this circumstance, you are best off with cluster-master. What I recommend is doing a little benchmarking, and see how much of a single core your program can use under the "worst case scenario". Let's say this is 33% of one core. If you have a quad core machine, you then tell cluster master to launch you 12 workers.
Hope this helped!

Is there a way to share memory among workers/threads/something in Node.JS?

I have a Node app which accesses a static, large (>100M), complex, in-memory data structure, accepts queries, and then serves out little slices of that data to the client over HTTP.
Most queries can be answered in tenths of a second. Hurray for Node!
But, for certain queries, searching this data structure takes a few seconds. This sucks because everyone else has to wait.
To serve more clients efficiently, I would like to use some sort of parallelism.
But, because this data structure is so large, I would like to share it among the workers or threads or what have you, so I don't burn hundreds of megabytes. This would be perfectly safe, because the data structure is not going to be written to. A typical 'fork()' in any other language would do it.
However, as far as I can tell, all the standard ways of doing parallelism in Node explicitly make this impossible. For safety, they don't want you to share anything.
But is there a way?
Background:
It is impractical to put this data structure in a database, or use memcached, or anything like that.
WebWorker API libraries and similar only allow short serialized messages to be passed in and out of the workers.
Node's Cluster uses a call named 'fork', but it is not really a fork of the existing process, it is spawning a new one. So once again, no shared memory.
Probably the really correct answer would be to use filesystem-like access to shared memory, aka tmpfs, or mmap. There are some node libraries that make mount() and mmap() available for exactly something like this. Unfortunately then one has to implement complex data structure access on top of synchronous seeks and reads. My application uses arrays of arrays of dicts and so on. It would be nice to not have to reimplement all that.

I tried write a C/C++ binding of shared memory access from nodejs. https://github.com/supipd/node-shm
Still work in progress (but working for me), maybe usefull, if bug or suggestion, inform me.

building with waf is old style (node 0.6 and below), new build is with gyp.
You should look at node cluster (http://nodejs.org/api/cluster.html). Not clear this is going to help you without having more details, but this runs multiple node processes on the same machine using fork.

Actually Node does support spawning processes. I'm not sure how close Node's fork is to real fork, but you can try it:
http://nodejs.org/api/child_process.html#child_process_child_process_fork_modulepath_args_options
By the way: it is not true that Node is unsuited for that. It is as suited as any other language/web server. You can always fire multiple instances of your server on different ports and put a proxy in front.
If you need more memory - add more memory. :) It is as simple as that. Also you should think about putting all of that data on a dedicated in-memory database like Redis or Memcached ( or even Couchbase if you need complex queries ). You won't have to worry about duplicating that data any more.

Most web applications spend the majority of their life waiting for network buffers and database reads. Node.js is designed to excel at this io bound work. If your work is truly bound by the CPU, you might be served better by another platform.
With that out of the way...
Use process.nextTick (perhaps even nested blocks) to make sure that expensive CPU work is properly asynchronous and not allowed to block your thread. This will make sure one client making expensive requests doesn't negatively impact all the others.
Use node.js cluster to add a worker process for each CPU in the system. Worker processes can all bind to a single HTTP port and use Memcached or Redis to share memory state. Workers also have a messaging API that can be used to keep an in-process memory cache synchronized, however it has some consistency limitations.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string