Parallelization of many Map/Reduce jobs in mongoDB performance

Parallelization of many Map/Reduce jobs in mongoDB performance - multithreading

I have to execute an operation that launches a lot of Map/Reduce (~400) but every Map/Reduce is on a different collection, so it can't be any concurrent write.
To improve the performance of this operation I paralyzed it by creating a thread on application side (I use the Java driver) for each Map/Reduce (note that I don't use sharding mode).
But when I compared the results I ended up with some worst results that with a sequential execution (mono-thread).
To be more precise : 341 sec for sequential execution, 904 for a distributed one.
So instead of getting better execution time, it's three time longer.
Someone knows why mongoDB don't like parallelization of Map/Reduce processes ?
I found an article about it (link), but now that mongoDB use the V8 engine I thought that should be ok.

First, do Map/Reduces at different databases, there was lock for per database(now at version 2.6).
Second, need more RAM and more faster disk IO, there may be the bottleneck.
Here is an example about how to use multi-core. http://edgystuff.tumblr.com/post/54709368492/how-to-speed-up-mongodb-map-reduce-by-20x
"The issue is that there is too much lock contention between the threads. MR is not very altruistic when locking (it yields every 1000 reads), and since MR jobs does a lot of writing too, threads end up waiting on each other. Since MongoDB has individual locks per database"

Related

What does Spark do if a node running a .foreach() fails?

We have a large RDD with millions of rows. Each row needs to be processed with a third-party optimizer that is licensed (Gurobi). We have a limited number of licenses.
We have been calling the optimizer in the Spark .map() function. The problem is that Spark will run many more mappers than it needs and throw away the results. This causes a problem with license exhaustion.
We're looking at calling Gurobi inside the Spark .foreach() method. This works, but we have two problems:
Getting the data back from the optimizer into another RDD. Our tentative plan for this is to write the results into a database (e.g. MongoDB or DynamoDB).
What happens if the node on which the .foreach() method dies? Spark guarantees that each foreach only runs once. Does it detect that it dies and restart it elsewhere? Or does something else happen?

In general if task executed with foreachPartition dies a whole job dies.
This means that, if not additional steps are taken to ensure correctness, partial result might have been acknowledged by an external system, leading to inconsistent state.
Considering limited number of licenses map or foreachPartition shouldn't make any difference. Not going into discussion if using Spark in this case makes any sense, the best way to resolve it, is to limit number of executor cores, to the number of licenses you own.

If the goal here is to limit just X number of concurrent calls, I would repartition the RDD with x, and then run a partition level operation. I think that should keep you from exhausting the licenses.

Nodejs and Sqlite. Perform long queries

I have to perform 2 queries: query A is long (20 seconds) and query B is fast (1 second).
I want to guarantee that query B is performed fast also if query A is running.
How can i achive this behaviour?

It may not be easy to do because of how SQLite does locking.
From official Appropriate Uses For SQLite documentation:
SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time. For many situations, this is not a problem. Writer queue up. Each application does its database work quickly and moves on, and no lock lasts for more than a few dozen milliseconds. But there are some applications that require more concurrency, and those applications may need to seek a different solution.
[...]
SQLite only supports one writer at a time per database file. But in most cases, a write transaction only takes milliseconds and so multiple writers can simply take turns. SQLite will handle more write concurrency that many people suspect. Nevertheless, client/server database systems, because they have a long-running server process at hand to coordinate access, can usually handle far more write concurrency than SQLite ever will.
It may not be the best way to use SQLite, as the SQLite documentation states, when you have so many data that a single query takes so long.
There is no easy solution to fix that, other than using a real RDBMS like PostgreSQL.
And since you didn't include those queries that take so long it's also impossible to tell you anything more than that. Of course maybe your queries could be optimized but we don't know that.

Slow query blocks fast queries until slow query gets executed : MongoDB with NodeJs driver

I'm running a slow query/aggregate taking 3+ seconds, then other query get blocked until the slow query completes.
After a slow query is executed, only a number of fast queries execute that is equal to connection pool size and then all further operations get blocked until slow query executes. After that fast queries execute normally.
I'm using MongoDB 2.6.7 and mongodb NodeJs driver 1.4.30.
Slow aggregate pipeline :
[{"$unwind": "$status_history"},{"$sort": {"_id": -1}},{"$limit": 100}]
I'm running above query on a collection having 10k documents which on unwind results in 200k documents and then $sort operates. This takes about 5-10 seconds.
After this any simple queries that usually execute in 100-500ms 3-10 seconds.

This is what i got from MongoBD support :
poolSize : allows you to control how many tcp connections are opened in parallel. The default value for this is 5 but you can set it as high as you want. The driver will use a round-robin strategy to dispatch and read from the tcp connection.
If the connection is busy running a slow operation, this could block a subsequent operation.
There are two possible work-arounds:
Increase the connection pool size
Using a larger connection pool when running longer ops would reduce the likelihood/frequency of this happening if you have no control over the types of queries being run, though still possible that an operation would be blocked and that possibility would increase as more long running operations come in.
With a bigger pool to round-robin through, it would be more likely that a long running op has completed on a connection by the time it is selected from the pool again.
Diverge the possible slower operations into another connection pool
You can create a separate connection pool (that is, result from MongoClient.connect() ) for the slower operations and leave the production pool to satisfy the fast queries. This would ensure that slow/blocking aggregation queries would not freeze the pools for other operations.
Slower operations are of course hard to determine before hand, but if for example they could be queries using sort, a large limit, a skip, or an aggregation or map reduce operation.

You need to optimize your queries using proper indexes because in MongoDB also there is locking of db, so all queries will be blocked for the sometime.
Please check here
Breaking down your big scan in multiple small scans and joining result sets also helps in most cases.
Also if possible create sharding with replica sets so you can distribute your queries

How to exaust a machine's resources with RethinkDB?

I'm asking this question because I would like to understand how I can run RethinkDB better, which means what kind of hardware should it be running on, what kind of filesystem it should be running on and other system configurations to maximize it's throughput.
I'm trying to fill a table as fast as I can with documents that are {"n": <counter>, "rand": <Math.random()>}. I read somewhere that this is faster with batches of 200 documents, so that's what I'm inserting. I am also using soft durability. I started one nodejs process of this and I can insert on average 10k documents per second, pretty good.
But while this is happening, rethinkdb is using about 70% of one core (I have 8 virtual cores, it's an i7-4770) and the nodejs process is using 5%. So it seems that CPU is not the bottleneck.
As soon as I start another nodejs process doing the same thing, the inserts per second on both processes drop to about 4k-5k. Again, the CPU load keeps the same.
I fired iotop and I do see a lot of action there, but not what I expected. I configured two SSDs in a RAID0, and a quick dd test says I can write and read at about 800MBps. That's far above what the actual read and actual write speed iotop reports (average read ~14MBps average write ~50MBps).
So how can I exaust my machine's resources ? What does rethinkdb need to run faster ? Why doesn't it spend more resources and have a higher throughput ?
More information on what it's running on: It's an EX40SSD from Hetzner, two SSDs in a software RAID0, ext4 filesystem (tomorrow I'll try to mount it with noatime to see if it's better). The rethinkdb configuration is everything by default, the inserts are done to a table that has only one shard and one replica. Please feel free to ask anything else relevant I might have forgotten to mention.
Thanks in advance.

What I suspect is going on here is lock contention on the actual btrees. When you're inserting a large batch of documents, the system grabs various parts of the btree in parallel to update it with the new documents. This is a set of read-write locks -- other parts of the system can still read, but if you insert another large batch in parallel, there is a high probability that it will touch similar parts of the btree, and therefore has to wait for the system to start unlocking as it inserts parts of the first batch. (This isn't specific to RethinkDB, but a problem in databases in general) This is likely why you're not hitting 100% CPU/disk throughput.
There are a few things you can try, but note that there are subtleties to various approaches. Benchmarking in general is hard.
You can try to shard the table into 32 shards and retry your benchmark. You don't actually have to create a cluster, you can shard into 32 shards on a single machine. This will result in multiple btrees, so you'll minimize contention and will be able to use more system resources. Note that while this will likely increase the throughput, increasing the number of shards also slightly increases the latency, so you might need to significantly increase parallelism before you start seeing throughput increases.
You can try not to batch writes and instead write one document at a time (which generally approximates real-world use cases a little better). Then, start hundreds of parallel clients instead of just one or two, and have all them write one document at a time in parallel. Note that you need to make sure the clients themselves aren't a bottleneck in this situation.
You can try to rerun your benchmark and also spin up clients that read from the database in parallel with the writes. In RethinkDB reads can usually go through even when you're writing to a particular document, so this will give you the opportunity to up CPU usage and bypass contention.
Pay attention to the IDs of the documents. If the database is large enough (say, millions of documents), and the IDs you're inserting are random, you're much less likely to touch the same parts of the btree so contention becomes less of an issue.
You can combine various approaches (sharding, reading + writing, various numbers of concurrent clients) to start getting a sense for how the database behaves in various scenarios.
Note that there might be things going on that you wouldn't normally be aware of. For example, RethinkDB has a log-structured storage engine that does live compaction on disk, and this might use up some IO (and CPU) cycles that you'd be surprised by if you didn't know about live compaction. There are dozens of other components like this that might compound to surprising behavior, as these systems are typically very complex under the hood.
Hope this helps -- would love to hear about your progress on the benchmarks. We do a lot of them internally, and it's an art and a science to discover the boundaries of the system's performance on different use cases.

My guess is that the bottleneck here is the disk system, but not its throughput. What's more likely is that writes are happening in chunks that are too small to be efficient, or that there are delays due to latency between individual writes.
It's also possible that the latency between individual write queries coming from the client and their processing on the server slows the system down.
Here are a few things I recommend trying:
Increase the batch size further. Your documents are very small. Therefore I think you might get a significantly higher throughput with batches of 1,000-10,000 documents. This might work especially well in combination with the next point.
Run multiple concurrent clients. You mentioned that you have run 2 clients concurrently, but this might not be enough. I recommend running 16-32 if possible.
Check the cache size RethinkDB is using. By default, RethinkDB picks the cache size as a fraction of the available memory, but that is not always reliable. I recommend passing a --cache-size <MB> parameter to RethinkDB (or adding the cache-size=<MB> parameter to the configuration file, if you're using one). I can see that your server has 32 GB of RAM. I recommend using in the range of 20000 MB (or even more) as the cache size. A larger cache reduces the number of reads, but up to a certain limit also increases the amount of unsaved data that RethinkDB can accumulate in RAM to make disk writes more efficient.
Experiment with the --io-threads <THREADS> parameter. The default is 64, but you can try increasing it to e.g. 128 and see if it has any effect.

Is it good to use Threads in CLR SQL?

There are 1,00,000 Update Statement available in a SQL table EexecuteQueue
Below is the Step I am planning to do.
Identify the Logical Processor of the Database server.
The queries available in the EexecuteQueue table will be split in to number of (logical processor-2) and execute in different thread.
My assumption is Instead of executing 1,00,000 update statement sequentially, threads will execute 25,000 update statements in parallel (If we have 4 Threads).
My Question
Is my assumption correct?
Is it good to user Threads in CLRSQL?
Thanks in advance.

My assumption is Instead of executing 1,00,000 update statement sequentially, threads will execute 25,000 update statements in parallel (If we have 4 Threads). Is my assumption correct?
Yes, but is completely irrelevant. Doing 25k operations on 4 threads by no means implies is going to be faster than doing 100k operations on a single thread. Such an assumption is, at best, naive. You need to identify your bottlenecks and address them accordingly, depending on your findings. Read How to analyse SQL Server performance.
Is it good to user Threads in CLRSQL?
No.
To speed up batch updates, use set based operations. Reduce number of round trips. Batch commit.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string