Should Slick DatabaseConfig and TableQuery values be shared & reused across threads & requests

Should Slick DatabaseConfig and TableQuery values be shared & reused across threads & requests - multithreading

I am using Slick (v3.1 currently). I create my DatabaseConfig from TypesafeConfig, and TableQuery values for my schemas.
Which of these things should I be sharing across:
Requests
Threads
It's not clear to me from reading the docs.
Here's my own attempt at answering.
I observe that the DatabaseConfig (or some part of it) owns a connection pool. So it seems pretty obvious that it's best to share this object across all requests and threads (treat it like a singleton).
For the TableQuery values, I have less insight. I haven't done extensive testing but I don't see a particular reason I need to share these either over multiple requests or between my threads.

Related

Consistent reads from multiple readers

Consider a scenario, a web request makes N database requests. If I know that all or majority of the requests can be sent to db-readers. With Vitess's architecture, when there are multiple readers setup, wouldn't those N db requests get distributed to different db-readers?
When different readers have different replication lag, it is possible that N db requests result in inconsistent results.
Does Vitess have special ways of handling this?
Or how should an application deal with such situation?

Vitess now supports replica transactions. So, that's what I'd recommend you use if you want consistent reads from replicas. There's a longer answer below if you don't want to use transactions.
The general idea of a replica read is that it's a dirty read. Even if you hit the same replica, the data could have changed from the previous read.
The only difference is that time moves forward if you went back to the same replica.
In reality, this is not very different from cases where you read older data from a different replica. Essentially, you have to deal with the fact that two pieces of data you read are potentially inconsistent with each other.
In other words, if you wrote the application to tolerate inconsistency between two reads, that code would likely tolerate reads that go back in time also. But it all depends on the situation.

Is it possible for Spark worker nodes broadcast variables?

I have a set of large variables that I broadcast. These variables are loaded from a clustered database. Is it possible to distribute the load from the database across worker nodes and then have each one broadcast their specific variables to all nodes for subsequent map operations?
Thanks!

Broadcast variables are generally passed to workers, but I can tell you what I did in a similar case in python.
If you know the total number of rows, you can try to create an RDD of that length and then run a map operation on it (which will be distributed to workers). In the map, the workers are running a function to get some piece of data (not sure how you are going to make them all get different data).
Each worker would retrieve required data through making the calls. You could then do a collectAsMap() to get a dictionary and broadcast that to all workers.
However keep in mind that you need all software dependencies of making client requests on each worker. You also need to keep socket usage in mind. I just did something similar with querying an API and did not see a rise in sockets, although I was making regular HTTP requests. Not sure....

Ok, so the answer it seems is no.
Calling sc.broadcast(someRDD) results in an error. You have to collect() it back to the driver first.

How can i use parallel transactions in neo4j?

I am currently working on an application using Neo4j as an embedded database.
And I wondering how it would be possible to make sure that separate threads use separate transactions. Normally, I would assign database operations to a transaction, but the code examples I found, don't allow for making sure that write operations use separate transactions:
try (Transaction tx = graphDb.beginTx()) {
Node node = graphDb.createNode();
tx.success();
}
As graphDB shall be used as a thread-safe singleton, I really don't see, how that shall work... (E.g. for several users creating a shopping list in separate transactions.)
I would be grateful for pointing out where I misunderstand the concept of transactions in Neo4j.
Best regards and many thanks in advance,
Oliver

The code you posted will run in separate transactions if executed by multiple threads, one transaction per thread.
The way this is achieved (and it's quite a common pattern) is storing transaction state against ThreadLocal (read the Javadoc and things will become clear).

Neo4j Transaction Management
In order to fully maintain data integrity and ensure good transactional behavior, Neo4j supports the ACID properties:
atomicity: If any part of a transaction fails, the database state is left unchanged.
consistency: Any transaction will leave the database in a consistent state.
isolation: During a transaction, modified data cannot be accessed by other operations.
durability: The DBMS can always recover the results of a committed transaction.
Specifically:
-All database operations that access the graph, indexes, or the schema must be performed in a transaction.
Here are the some useful links to understand Neo4j transactions
http://neo4j.com/docs/stable/rest-api-transactional.html
http://neo4j.com/docs/stable/query-transactions.html
http://comments.gmane.org/gmane.comp.db.neo4j.user/20442

Optimal number of actor instances(threads) needed to load document on solr cloud

I have a situation where I need to load documents from my app (in millions) into *solr cloud with zookeeper as a configuration synchronization service *. I am stuck with the performance issues due to lot of incoming document flux. Let's say I have two shards of solr running and two instances of zookeeper host for each shard. So my approach is something like this :
var rtr = system.actorOf(Props(new solrCloudActor(zkHost,core)).withRouter(SmallestMailboxRouter(nrOfInstances = 8)))
//router vector created globally with 8 instances based on some black box tests that single solr instance can utilize 8 threads in parallel for loading.
.
..
...
val doc:SolrInputDocument = new SolrInputDocument() //repeated million times depending on number of documents and creating docs here
doc.addfield("key","value")
.
...
rtr ! loadDoc(doc) // broadcasting the doc here
class solrCloudActor(zkHost:String,solrCoreName:String) extends Actor{
val server:CloudSolrServer = new CloudSolrServer(zkHost)
server.setDefaultCollection(solrCoreName)
def recieve{
case loadDoc(d:SolrInputDocument) => server.add(d)
}
}
My few concerns here :
Is this approach correct .Actually this made sense when I had single instance of solr and created 8 router vector instances of httpclient actor instead of solrcloud with zookeeper .
What is the optimal number of threads needed to make the solr loading at its peak when I have millions of documents in queue.Is it numofshards x some_optimal_number or the number of threads depends on per shard per core basis or is it the average :(numofshards x some_optimal_number + numberofcore)/numberofcore ..
Do I even need to worry about parallelism ? Can the single solrcloud server instance to which I initiate by providing all comma separated zookeeper host takes care of the distribution of docs.
If at all I am going in complete wrong direction please suggest a better way to improve performance.

Number of Actors and number of threads is not the same thing. Actors use threads from a pool as and when they have work to do.
The number of threads that can be running concurrently is limited to the pool size which (unless otherwise specifically configured) is dynamic, but typically matches the number of cores.
So the ideal number of pooled actors is roughly the same as the number of pooled threads.
The number of pooled threads, in an ideal world, is the number of cores.
But... we don't live in an ideal world. An ideal world has no blocking operations, no network or other IO latency, no other processes competing for resources on the machine, etc. etc.
In a non-ideal (a.k.a real) world. The best number depends on your codebase and your specific environment. Only you and your profiler can answer that one.

Multiple node instances with a single database

I'm currently writing a Node app and I'm thinking ahead in scaling. As I understand, horizontal scaling is one of the easier ways to scale up an application to handle more concurrent requests. My working copy currently uses MongoDb on the backend.
My question is thus this: I have a data structure that resembles a linked list that requires the order to be strictly maintained. My (imaginary) concern is that when there is a race condition to the database via multiple node instances, it is possible that the resolution of the linked list will be incorrect.
To give an example: Imagine the server having this list a->b. Instance 1 comes in with object c and instance 2 comes in with object d. It is possible that there is a race condition in which both instances read a->b and decides to append their own objects to the list. Instance 1 will then imagine it's insertion to be a->b->c while instance 2 think it's a->b->d when the database actually holds a->b->c->d.
In general, this sounds like a job for optimistic locking, however, as I understand, neither MongoDB or Redis (the other database that I am considering) does transactions in the SQL manner.
I therefore imagine the solution to be one of the below:
Implement my own transaction in MongoDB using flags. The client does a findAndModify on the lock variable and if successful, performs the operations. If unsuccessful, the client retries after a certain timeout.
Use Redis transactions and pubsub to achieve the same effect. I'm not exactly sure how to do this yet, but it sounds like it might be plausible.
Implement some sort of smart load balancing. If multiple clients is operating on the same item, route them to the same instance. Since JS is single threaded, the problem would be solved. Unfortunately, I didn't find a straightforward solution to that.
I sure there exists a better, more elegant way to achieve the above, and I would love to hear any solutions or suggestions. Thank you!

If I understood correctly, and the list is being stored as one single document, you might be looking at row versioning. So add a property to the document that will handle the version, when you update, you increase (or change) the version and you make that a conditional update:
//update(condition, value)
update({version: whateverYouReceivedWhenYouDidFind}, newValue)
Hope it helps.
Gus

You want the findAndModify command on mongodb that will guarantee an atomic modification while returning the newly modified doc. As the changes are serial and atomic instance 1 will have a->b->c and instance 2 will have a->b->c->d
Cheers

If all you are doing is adding new elements to the list, you could use a Redis list and include the time in every value you add. The list may be unsorted on redis but should be quickly sortable when retrieved.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string