Difference between Document-oriented-DB and Bigtable clones - couchdb

Can someone give a head-to-head comparison between them?
We are looking for a suitable storage engine for our weblog history data. We looked at Bigtable's paper and understand it is suitable to us well.
However, I also understand that Document-oriented-DB such as MongoDB seems to provide a little more powerful schema power -- i.e, it can model our data as well.
I wonder how nowadays ppl choose a scalable NoSQL DB --- I read enough articles like "we looked at A, B and C, and we decided to use C". But I'd like to see some benchmark number. What I am saying is that if MongoDB and the like can provide same level of performance as Bigtable clones, why don't web companies choose it (preparing to deal with various potentially more complex data problem)?
Thanks,
By the way, I read an article (which convinced me at the moment) saying Cassandra does not fit the M/R operation, any comments?

"I read an article (which convinced me at the moment) saying Cassandra does not fit the M/R operation, any comments?"
Cassandra 0.6 supports map/reduce. Your source was obsolete, apparently.

There's a not too detailed comparison here (notice the .pdf), but it's probably good enough to narrow down your search to 2-3 options.

Related

Couchdb Mango Performance vs Map Reduce Views

I've just noticed that in the release notes of Couchdb 2.0, it is mentionned that Mango queries are recommended for new applications. It is also mentionned that apparently Mango indexes are from 2x to x10 faster than javascript queries which really surprised me, as such I have a number of questions :
Are Map/Reduce views being phased out ? I'm expecting the answer to be no since it seems to me that Mango does not cover all the use cases of Map/Reduce (the easiest example being Reduce itself), and the flexibility of this querying style seems to be more limited too. But m prefer to ask because of the recommendation :
We recommend all new apps start using Mango as a default.
We know that Map/Reduce views rely on B-trees, but I can't find any insight, in the doc or the mailing list regarding the magic behind Mango. Mango essentially is white magic for me at the minute. Yet I can tell that having an in-depth knowledge of how the javascript views are indexed behind the scenes was massively helpful to avoid pitfalls, naive implementations as well as to optimize performances. Does anyone have any insight on how Mango works ? Are the indexes B-trees too ? When are the indexes updated since there is no longer design documents ? Where do the performance gains come from ? (these gains are counter-intuitive to me, since in my understanding, the performance of javascript queries came from the precomputed nature of Map functions)
What I'm essentially after is on the one hand some insight regarding Mango and on the other hand, an overview of how Mango and Map/Reduce are supposed to live together in the 2.x era.
I recently tried to switch my app over to using Mango queries, with the result of scrapping it completely and switching back to map/reduce. Here are a few of my reasons:
Mango is buggy when dealing with queries that do not exactly specify the index to use. This one drove me batty for a while last weekend. If you don't specify the index, sometimes an alternate index will be selected and return no (or incorrect) results.
Mango performance is not 'magic'. Many types of queries will end up doing in memory searches. Couch will select the best fit index then march through all those records in memory to fit the corner cases. Cloudant hand waves over some of these issues by saying to use 'text' based searches, which aren't available in Couchdb.
As you pointed out, Mango searches simply cannot handle some types of query constructions well. I wouldn't consider my app to be overly complicated yet I ran into several situations where I could not construct a suitable Mango query for the task at hand. A major one here is searching arrays to find tags (for example, searching to see what users are members of a group). Mango cannot index array elements so resorts to doing full scans in memory.
Views have some very powerful features for transformation of search results in the form of Lists. That doesn't exist in Mango.
Your mileage may vary, but just wanted to leave a warning that this is still quite new features.
Answer from a core developer :
Some good questions. I don't think Mango will ever replace Map/Reduce
completely. It is an alternative querying tool. What is great about
the Mango query syntax is that it is a lot easier to understand and
get started. And we can use it in a lot of places outside of just
querying for documents. It can be used for replication filtering and
the changes feed. We hope to soon have support for validation doc
updates as well.
Underneath Mango is using erlang map/reduce. Which means it is
creating a B-tree index just like map/reduce. What makes it faster is
that it is using erlang/native functions to create the B-Tree instead
of javascript. I wrote a blog post a long time ago about the internals
of PouchDB-find [1] which is the mango syntax for PouchDB. It might
help you understand a little more how the internals work. The key
thing to understand is that there is a Map query part which uses the
B-Tree and an in-memory filter. Ideally the less memory filtering you
do the faster your query will be.
I would say that Mango is very much a work in process but the basic
ground work is done. There are definitely things we can improve on.
I've seen it used quite a bit when developers start a new project
because its quick and simple to do basic querying, like find by email
address or find all users with the name "John Rambo".
Hope that helps.
[1] http://www.redcometlabs.com/blog/2015/12/1/a-look-under-the-covers-of-pouchdb-find
I am new to Mango and CouchDB but I think I can provide some insight. Once your index/view is updated, Mango is not any faster. The large performance gain with Mango is when you are creating the index for the first time because couch doesn't need to create a separate couchjs process for this.
I found that Mango works well even when some of your documents are large. Currently with CouchDB 2.0.0, at least with windows, large documents crash the couchjs.exe view server used with Map/Reduce. This is not the case with CouchDB 1.6.1 and is already fixed in the development version https://github.com/apache/couchdb-couch/commit/1659fda5dd1808f55946a637fc26c73913b57e96

What (in_memory) graph DB if modeling data is focused

I am out of ideas and hope to get some useful input. I am using this question to compress my experiences and share them, hoping to inspire some distributors to go the next step with modeling graph databases as a first class question/way.
I've been validating some graph database solutions usable by node.js for a few weeks. My use case is to save interactions of different social user network accounts. The need is to use CPU and memory in the most efficient way.
My most important requirements are:
in_memory (at least for indexing)
open source (and free to use)
same JavaScript/Node.js performance as first class citizen
comfortable query and modeling language
Neo4J
I really like cypher so my best choice would be Neo4j.
But the major issue about Neo4j is the JavaScript access is non-native. It uses the REST-API which is about ten times (10x) slower than direct Java access. So I took a look at node-neo4j-embedded, but it has been inactive for more than two years. It looks like its author isn't active at all (bad sign).
ArangoDB
The really nice core developers of ArangoDB answered to my question about internals. Finally it means JavaScript is first class citizen because native queries can be pushed out of JS. Looking at the open source benchmarks, I think it is fair. But I am afraid they didn't use node-neo4j-embedded for their benchmark. The benchmarks compare the REST-APIs (Edited because of #weinberger comment). I wished they compare the native APIs (maybe someone is snoopy enough and give it a try! - let us know!). Update: As I noticed now, OrientDB has answered the benchmark with a new node.js driver (using Command Cache by starting the server with -Dcommand.cache.enabled=true -Dcommand.cache.minExecutionTime=3, what isn't fair, because it wasn't a query caches benchmark!)
Because I like to use ArangoDB as a graph database I would have 3 choices (source: FAQ):
traverse JS objects
using AQLs graph functions
using the REST API
In general it isn't comfortable like cypher. And I am not sure how to compare and what is the right way modeling data (like Neo4J explains very well). I'd love to have something like this for ArangoDB Graphs. It feels like ArangoDB is focused on graph operations and Neo4J fits more the needs of using graphs if you have more relations than rows (the reason to use graphs instead of relations with joins).
MongoDB
The document based MongoDB isn't optimized for graph operations but latterly has gotten an experimental in_memory storage engine. Also there are some projects either in_memory or graph related but nothing is really compelling. And at this discussion it looks like MongoDB isn't what I like to use.
OrientDB
Because there is a comparison about OrientDB vs. MongoDB available (from OrientDB) I though about to use this one. "OrientDB has a hybrid Document-Graph engine" using SQL. I am a former PHP/MySQL expert. But where is the modeling part ? Their chapter working with graphs is not cypher like. It is like using SQL for Graphs. There is nothing wrong with that, but using cypher before I miss the modeling like feeling.
If someone did a modeling process with OrientDB and Graphs maybe you could write a tutorial like Neo4J had done.
Update: About JavaScript access like first citizen there are news:
"In the next release the speed of this driver will be comparable to the native Java one" The forked node.js driver had bin fixed last days.
Update: Before choosing OrientDB one might want to read article about some issues and discussions linked from there. The article is touching a sensitive issue and should be approached with critical mind. Note from author of this update: I'm new to editing SO and don't have enough reputation to put this to comments. I believe this information is a valid point to discussion, not sure how to place it here according to SO rules.
LokiJS
Before I was looking at Neo4J, ArangoDB and MongoDB, I played around with that JavaScript based in_memory database called LokiJS, what seams to follow the strategy to ignore everything what slows down performance and efficiency. LokiJS is trying to complete the Mongo-Style (RoadMap). The major issue is the bad ability to scale. Of cause it isn't a graph database but it was an interesting solution while the beginning of my project. Also it wasn't a perfect feeling to find all the distributed documentation (maybe they should reboot with GitBook).
Finally LokiJS is a very interesting project at all and I hope they will go forward!
LevelDB
Previously when I wrote my degree paper I was looking at levelDB. Remembering this while writing this post, I searched for LevelDB in_memory and got a promising result called MemDown (see also). I haven't tested this find, but maybe someone has experiences working and modeling for this solution. Maybe it would be the most efficient way if all the others will not fit because I would simply write a lightweight cypher clone with the goal to stay much lightweight as I can do.
Edit: Due to comment, here is a link to LevelGraph. As an idea to implement a CYPHER parser for LevelGraph/LevelDB your starting point would be to compare
Cypher:
CREATE (SUBJECT:"a") - [b:PREDICATE] -> (OBJECT:"c")
RETURN, subject, predicate, object
LevelGraph:
var RETURN = { SUBJECT: "a", PREDICATE: "b", OBJECT: "c" };
db.put(RETURN, function(err) {
// ..
});
Conclusion
As you likely noticed I am not the super hero about graphs. But this is my initial dive into this and I'm trying to get an overview. I assume there are a lot people out there who want to ask the same questions as me but haven't the time. I hope this post will help a lot people and will change by comments and answers to become a well done overview how to modeling data for graphs.
#editors: You are welcome.
#commenters: This is the result of my personal research - if you also have done a journey like me, please answer with a short summary like I have done for each DB I've evaluated (don't forget to target my 4 goals).
The idea to combine node-style performance through any of the native features (e.g. streams) and a high level query language like CYPHER is actually quite neat.
What you likely won't get is any kind of low level API, since this is rather rare with DB authors and, supposedly, not wanted in their design patterns. So, long running tcp connections shall just serve fine.
cypher-stream since to incorporate all of this, while (superficially judged) maintaining a good style.
Since you likely won't get any further with the search, I'd suggest sending him a pull request if any other features are needed :)
You should take a look at Gundb https://github.com/amark/gun
It's open source and has a very active and helpfull lead developer.
Join us at https://gitter.im/amark/gun

Cassandra or mongodb or something else for big online sales site

Currently we are using mongodb as our primary store for big online sales site, and currently we are focusing ourselves on big scalability among multiple machines.
Site backend is written in node.js and we are using mongoose as ODM.
I can see many blog posts which are writing about awesome cassandra DB, and I am starting to think about switching to cassandra. But still I am not sure if this is a really good decision, because I didn't found any good ODM/ORM lib for cassandra and node.js (and writing raw queries can be pain. Also writing good tested ORM/ODM can be time consuming task). So I am not sure how much benefit will I have after this switch. We are using elasticsearch as search engine, and it works excellent in combination with mongodb, and I am asking my self will do also good with cassandra.
If you have any experiance with this, it will be very helpfull.
Thank you!
Cassandra is a very nicely designed database, which can fulfill a lot of scenarios. MongoDB is also a really good DB engine. So let me just compare couple of main bullet points for you.
Always on system
Cassandra is really great when you need to provide 24x7 operations in multiple data centers. If you got more then one datacenter with multiple servers in each of them then Cassandra is great for you. Cassandra can sync writes to more than one datacenter and maintain desired data consistency across complex set ups. Recovery and re-sync is also quite easy.
On the other note MongoDB is easy to operate. If you got one data center and only couple of servers it might be a perfect fit (although global write lock might be a pain over time). In simple deployments it's easy to maintain and monitor.
Scalability
To continue the above statements - Cassandra is linearly scalable. There is, literally, no limit of how big the cluster will be. Your writes will always stay fast, while reads might become more complicated over time - depending on the structure of your data.
Denormalization of data
With Cassandra your writes and reads can be extremely fast if you will create a structure that will reflect what you need to get from your data. There is no query language (well, there is, but it's not exactly SQL) that you can use to reorganize your result set using aggregates, groupings, etc. Yes, some things are doable and some not - that is very specific to Cassandra data model. You will have to implement a lot of things on your own and write the result to the DB - i.e. counters for aggregation, different groupings, etc.
In comparison MongoDB is easy to use, easier to learn and more flexible - both for development (as knowledge curve/efforts goes) and for implementation of business logic (as time/effort is considered). That is - kind of - a reason why there are ORM engines for MongoDB and only couple (very limited) for Cassandra.
To summarize - both DBs are really good... if you will embrace their limitations. If you got only 100GB of data and you need flexible, easy to implement DB engine I would stick to MongoDB, alternatively take a look RethinkDB which have a very similar model and way better (in my personal opinion) clustering/data center replication implementation.
Cassandra is a great option for you if you will need to store TBs of data soon, deploying your apps across multiple data centers while accepting the cost of additional efforts to implement the same features and maintaining similar capabilities.
Don't take it personally that I have used the word only while describing your data set. Yes, it's not big - my company stores more than 20 TB these days... so yeah, 100GB is really not that much...
To stop everyone from pointing that I should compare some other features or point out some other differences between those two - it's just a rough, high level overview on the things I consider relevant to the problem, not a full comparison or analysis of the problem. But feel free to point out what I have missed and I will be happy to include new stuff in this answer...

Pick database for ads/analytics service

Now I have a project with ads exchange service (something like google double click) and I have to pick a high-scalable database. I'm thinking about mongodb or cassandra.
Cassandra:
fit with our write-intensive system. (+)
looks hard to do aggregate(very important for analytics) (is there a good way? Just read slide about Twitter rainbird, seem good) (?)
I dont prefer java much. (-)
MongoDB:
Seem easier to do analytics. (have build-in aggregate functions) (+)
more RAM-consuming? (because of document-oriented vs key-value Cassandra) (?)
write perfomance compare to Cassandra? (?)
javascript shell and natural fit with node.js(one important part in our project) (+)
http://pastebin.com/raw.php?i=FD3xe6Jt - This article make me cautious. (-)
Can you guys help me to pick the one or answer some of my questions above
Thanks.
I don't know about Cassandra, but MongoDB has some advantages for using it for analytics: high concurrency, sharding, storing everything about an event in a single document, features like upsert and $inc.
For more detailed explanations check the following resources:
MongoDB Analytics - videos
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics
http://www.mongodb.org/display/DOCS/Use+Cases
http://www.slideshare.net/jrosoff/scalable-event-analytics-with-mongodb-ruby-on-rails
http://nosql.mypopescu.com/post/3508305955/fast-asynchronous-analytics-with-mongodb
http://blog.opengovernment.org/2011/02/24/fast-asynchronous-analytics-with-mongodb/
http://blog.10gen.com/post/4416876632/london-startup-ubervu-on-storing-5tb-of-data-in-mongodb
It depends a lot on your domain, most cases one would probably choose Mongo.
For example http://square.github.com/cube/ is built on Mongo.
Cube is an open-source system for visualizing time series data, built on MongoDB, Node and D3. If you send Cube timestamped events (with optional structured data), you can easily build realtime visualizations of aggregate metrics for internal dashboards. For example, you might use Cube to monitor traffic to your website, counting the number of requests in 5-minute intervals:
Most use cases of Cassandra draw from the need oh high availability that's the main feature of it afaik. Your needs seem to be centered around having a cheap way to shove queryable data in a scale-out DB, and mongo almost matches RDBMS in regards to querying. Mongo is also probably easier to deal with.
I think cassandra is a good fit for this problem.
You don't need to know much java to get it running (other than install java), as long as there is a client library in your chosen language.
Cassandra 0.8+ now has atomic counter support - perfect for impressions/click tracking.
You could also run hadoop on top of cassandra, giving you a proven platform for writing map reduce jobs to do analytics/aggregations and store the results back to Cassandra too.
Check out this slideshow about cassandra and hadoop: http://www.slideshare.net/jeromatron/cassandrahadoop-4399672
I hope that helps.

Is NoSQL ideal to store stats?

I'm not terribly familiar with NoSQL systems, but I remember reading a while back that they are ideal to handle statistical data.
Since I'm about to start writing code that will record data like "how many users were registered on each day", I was thinking I could use this as an opportunity to learn more about NoSQL if it fits the bill.
If NoSQL is indeed ideal for this, could you provide me with some information as to why? And which specific systems are best suited for this particular need?
So, after the first answer, maybe it's helpful to clarify a bit more.
I currently have a PostgreSQL database from which I'll get the data. It will be very simple, and no calculations needed. For example, I'll just get a resultset with the amount of users registered each day for the past month (so it'll basically just be a set of value pairs for the date/users) and save that in another table/database.
Thanks!
It kind of depends on what sorts of analysis you are going to be doing on these stats. If you are going to be doing a lot of different operations (averaging, summing, joining...) you may find NoSQL solutions to be more of a pain then they are worth.
However, if you are storing stats mostly for a display purpose, or for very specific analysis routines, NoSQL solutions start to shine.
If your data is small enough, stick with a SQL solution, which will give the benefit of a full query engine to work with, but if you have lots of values (one value a day is nothing, even if you were running for a million years), and are worried about storage size and performance, NoSQL options once again may be worth it.
If your data is semi-structured, take a look at CouchDB, which offers some rudimentary indexing and querying support, which could provide some basis for analysis routines. If you are storing individual values with very little structure, my best advice would be to take a look at Tokyo Cabinet and Tokyo Tyrant, which are absolutely incredible options for key-value storage.
NoSQL systems tend to optimize the case where data is stored frequently, but accessed infrequently. In the case of statistics, you might gather lots of data from a (social) site frequently in small bits, which is optimized for. But retrieval and analysis might be slower... It of course depends on which "NoSql" System you decide to use.

Resources