I have a mongodb with thousands of records holding very long vectors.
I am looking for correlations between an input vector with my MDB data set using a certain algorithm.
psudo code:
function find_best_correlation(input_vector)
max_correlation = 0
return_vector = []
foreach reference_vector in dataset:
if calculateCorrelation(input_vector,reference_vector) > max_correlation then:
return_vector = reference_vector
return return_vector
This is a very good candidate for map-reduce pattern as I don't care for the order the calculations are run in.
The issue is that my database is on one node.
I would like to run many mappings simultaneously (I have an 8 core machine)
From what I understand, MongoDb only uses one thread of execution per node - in practice I am running my data set serially.
Is this correct?
If so can I configure the number of processes/threads per map-reduce run?
If I manage multiple threads running map-reduce in parallel and then aggregate the results will I have substantial performance increase (Has anybody tried)?
If not - can i have multiple replications of my DB on the same node and "trick" mongoDb to run on 2 replications?
Thanks!
Map reduce in MongoDB uses Spidermonkey, a single-threaded Javascript engine, so it is not possible to configure multiple processes (and there are no "tricks"). There is a JIRA ticket to use a multi-threaded JS engine, which you can follow here:
https://jira.mongodb.org/browse/SERVER-2407
If possible, I would consider looking into the new aggregation framework (available in MongoDB version 2.2), which is written in C++ instead of Javascript and may offer performance improvements:
http://docs.mongodb.org/manual/applications/aggregation/
Related
I am trying to do CRUD operations on MongoDB of a very large size around 20GB data and there can be multiple such versions of data. Can anyone guide me on how to handle such high data for the CRUD operations and maintaining the previous versions of the data in MongoDB?
I am using NodeJS as backend and I can also use any other database if required.
Mongodb is a reliable database, I am using it to processes 10-11 billions of data every single day nodejs should also be fine as long as you are handling the files in streams of data.
Things you should do to Optimize:
Indexing - this will be the biggest part, if you want faster queries you better look into indexing in mongodb, every single document needs to be indexed according to your query, else you are going to have a tough time dealing with queries.
Sharding and Replication - this will help you organise the data and increases the query speed, replication would allow you to have your reads and writes separated(there are cons for replication you can read about that in the mongodb documentation).
This are the main things you need to consider, there are a lot but this should get you started...;) need any help please do let me know.
I am using Hazelcast 3.6.1. It is set up as a server/client. A Map is on the server (single node) and it is about 4Gb of data. My program creates a client and then needs to look up some data (very small in size - like 30MB). So I was getting the data from the map and looping through all of it to search for the data of interest - before I knew it the process size was 4Gb (as I did a get on the map for each piece of data I was analyzing it was loading it into memory (Lazy) until all the data was loaded!). So, I discovered that I could use aggregation which I was under the impression was all done server side and only the part I was interested in was returned to the client, but the client process still grows to 350MB!
Is aggregation solely done on the server?
Thanks
First of all you should upgrade to Hazelcast 3.8.x versions since the new aggregation system is way faster. Apart from that it depends on what you try to aggregate, but if you do real aggregations like sum, min or similar, aggregations is the way to got. The documentation for 3.8.x fast-aggregations is available here: http://docs.hazelcast.org/docs/3.8.3/manual/html-single/index.html#fast-aggregations
After some testing it appears that the collator portion of the aggregator is being done on the client.
I have to perform 2 queries: query A is long (20 seconds) and query B is fast (1 second).
I want to guarantee that query B is performed fast also if query A is running.
How can i achive this behaviour?
It may not be easy to do because of how SQLite does locking.
From official Appropriate Uses For SQLite documentation:
SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time. For many situations, this is not a problem. Writer queue up. Each application does its database work quickly and moves on, and no lock lasts for more than a few dozen milliseconds. But there are some applications that require more concurrency, and those applications may need to seek a different solution.
[...]
SQLite only supports one writer at a time per database file. But in most cases, a write transaction only takes milliseconds and so multiple writers can simply take turns. SQLite will handle more write concurrency that many people suspect. Nevertheless, client/server database systems, because they have a long-running server process at hand to coordinate access, can usually handle far more write concurrency than SQLite ever will.
It may not be the best way to use SQLite, as the SQLite documentation states, when you have so many data that a single query takes so long.
There is no easy solution to fix that, other than using a real RDBMS like PostgreSQL.
And since you didn't include those queries that take so long it's also impossible to tell you anything more than that. Of course maybe your queries could be optimized but we don't know that.
we encounter a new problem with our anrango installation. If we send an complex AQL query like iterating over multiple collections to find specific information and then follow edges etc, the whole database blocks. We see that one of our three CPU cores is at 100% the other two are around 0%-1%. While the AQL query runs, the database does not react to any other request and the web interface is unreachable, too. This means that the whole processing is halted until the one query finished.
There are two problem in this:
First: The query takes much to long (graph queries)
Second: The database does not react while the one query is in work.
Any ideas/solutions for this problem? What are the biggest databases/graphs you have successfully worked with?
Thx, secana
ArangoDB 2.8 contains a deadlock detection. So ArangoDB will now raise an exception if your query blocks on locking.
ArangoDB 2.8 also offers fast graph traversals which improve graph performance a lot.
Another good solution is to separate reading to a second instance with a replication slave.
With RocksDB as storage engine (available since 3.2) there are no collection-level locks anymore, which means most queries can be executed in parallel without blocking: https://docs.arangodb.com/3.4/Manual/Architecture/StorageEngines.html
I have to execute an operation that launches a lot of Map/Reduce (~400) but every Map/Reduce is on a different collection, so it can't be any concurrent write.
To improve the performance of this operation I paralyzed it by creating a thread on application side (I use the Java driver) for each Map/Reduce (note that I don't use sharding mode).
But when I compared the results I ended up with some worst results that with a sequential execution (mono-thread).
To be more precise : 341 sec for sequential execution, 904 for a distributed one.
So instead of getting better execution time, it's three time longer.
Someone knows why mongoDB don't like parallelization of Map/Reduce processes ?
I found an article about it (link), but now that mongoDB use the V8 engine I thought that should be ok.
First, do Map/Reduces at different databases, there was lock for per database(now at version 2.6).
Second, need more RAM and more faster disk IO, there may be the bottleneck.
Here is an example about how to use multi-core. http://edgystuff.tumblr.com/post/54709368492/how-to-speed-up-mongodb-map-reduce-by-20x
"The issue is that there is too much lock contention between the threads. MR is not very altruistic when locking (it yields every 1000 reads), and since MR jobs does a lot of writing too, threads end up waiting on each other. Since MongoDB has individual locks per database"