Does CouchDB really split views across servers? - couchdb

I'm currently delving into CouchDB, and I am puzzled by the distribution of Map-Reduce computations in views. I see a lot of resources mentioning that Map-Reduce is inherently distributed, because you can process one half of your data on server A, the other half on server B, and then reduce both results. One example would be slide 16 of this presentation:
http://www.slideshare.net/gabriele.lana/couchdb-vs-mongodb-2982288
This seems fairly logical, but:
CouchDB does not seem to provide an API for dispatching computations to several servers. The only distribution it appears to provide is replication of the entire data set to other servers (which would then, I assume, compute their own view data).
CouchDB uses a B-Tree to store view data based on keys that are generated in the Map step of the view algorithm, which precludes appropriate partitioning of documents based on what server they should be on.
So, does CouchDB distribute Map-Reduce computations at all? Or is the Map-Reduce property used merely to cache values in the B-Tree nodes?

You are looking for BigCouch, it enables a CouchDB cluster and uses distributed MapReduce.

CouchDB does NOT distributed views across nodes, since couchdb is not a distributed application. You can only continously-replicate from one instance to the other, but still each instance works alone.

Related

Why Cassandra doesn't have secondary index?

Cassandra is positioned as scalable and fast database.
Why , I mean from technical details, above goals cannot be accomplished with secondary indexes?
Cassandra does indeed have secondary indexes. But secondary index usage doesn't work well with distributed databases, and it's because each node only holds a subset of the overall dataset.
I previously wrote an answer which discussed the underlying details of secondary index queries:
How do secondary indexes work in Cassandra?
While it should help give you some understanding of what's going on, that answer is written from the context of first querying by a partition key. This is an important distinction, as secondary index usage within a partition should perform well.
The problem is when querying only by a secondary index, that Cassandra cannot guarantee all of your data will be able to be served by a single node. When this happens, Cassandra designates a node as a coordinator, which in turn queries all other nodes for the specified indexed values.
Essentially, instead of performing sequential reads from a single node, secondary index usage forces Cassandra to perform random reads from all nodes. Now you don't have just disk seek time, but also network time complicating things.
The recommendation for Cassandra modeling, is to duplicate your data into new tables to support the desired query. This adds in some other complications with keeping data in-sync. But (when done correctly) it ensures that your queries can indeed be served by a single node. That's a tradeoff you need to make when building your model. You can have convenience or performance, but not both.
So yes cassandra does have secondary indexes and aaron's explaination does a great job of explaining why.
You see many people trying to solve this issue by writing their data to multiple tables. This is done so they can be sure that the data they need to answer the query that would traditionally rely on a secondary index is on the same node.
Some of the recent iterations of cassandra have this 'built in' via materialized views. I've not really used them since 3.0.11 but they are promising. The problems i had at the time were primarily adding them to tables with existing data and they had a suprisingly large amount of overhead on write (increased latency).

Inject Custom Sharding in Cassandra or Couchbase

Can I inject a sharding algorithm to wither Cassandra or Couchbase?
Or do they decide where each document go to?
For instance if I want to pin data to shards by one of the data properties.
Couchbase hash the key of the document to decide in which shard(vBucket) the document should be associated with. The SDK also uses the same algorithm to find out in which shard the document is located when you want to retrieve the document by its key.
One of the problems of letting developers decide on the sharding algorithm is that sometimes they end up with an excessive number of documents in a single shard, and naturally, this shard becomes the bottleneck of the application.
One of the core concepts in Couchbase is that the documents are (almost) evenly distributed between all shards, so I am not familiar with any native support to insert your own algorithm there.
Cassandra decides where the data goes by the partition key. So if you use the data you want to use as the "pin" as the partition key then it will accomplish what your asking for I think. However, you don't pick the replicas explicitly and it can change as hosts are removed and added to the cluster.

How to export large Neo4j datasets for analysis in an automated fashion

I've run into a technical challenge around Neo4j usage that has had me stumped for a while. My organization uses Neo4j to model customer interaction patterns. The graph has grown to a size of around 2 million nodes and 7 million edges. All nodes and edges have between 5 and 10 metadata properties. Every day, we export data on all of our customers from Neo4j to a series of python processes that perform business logic.
Our original method of data export was to use paginated cypher queries to pull the data we needed. For each customer node, the cypher queries had to collect many types of surrounding nodes and edges so that the business logic could be performed with the necessary context. Unfortunately, as the size and density of the data grew, these paginated queries began to take too long to be practical.
Our current approach uses a custom Neo4j procedure to iterate over nodes, collect the necessary surrounding nodes and edges, serialize the data, and place it on a Kafka queue for downstream consumption. This method worked for some time but is now taking long enough so that it is also becoming impractical, especially considering that we expect the graph to grow an order of magnitude in size.
I have tried the cypher-for-apache-spark and neo4j-spark-connector projects, neither of which have been able to provide the query and data transfer speeds that we need.
We currently run on a single Neo4j instance with 32GB memory and 8 cores. Would a cluster help mitigate this issue?
Does anyone have any ideas or tips for how to perform this kind of data export? Any insight into the problem would be greatly appreciated!
As far as I remember Neo4j doesn't support horizontal scaling and all data is stored in a single node. To use Spark you could try to store your graph in 2+ nodes and load the parts of the dataset from these separate nodes to "simulate" the parallelization. I don't know if it's supported in both of connectors you quote.
But as told in the comments of your question, maybe you could try an alternative approach. An idea:
Find a data structure representing everything you need to train your model.
Store such "flattened" graph in some key-value store (Redis, Cassandra, DynamoDB...)
Now if something changes in the graph, push the message to your Kafka topic
Add consumers updating the data in the graph and in your key-value store directly after (= make just an update of the graph branch impacted by the change, no need to export the whole graph or change the key-value store at the same moment but it would very probably lead to duplicate the logic)
Make your model querying directly the key-value store.
It depends also on how often your data changes, how deep and breadth is your graph ?
Neo4j Enterprise supports clustering, you could use the Causal Cluster feature and launch as many read replicas as needed, run the queries in parallel on the read replicas, see this link: https://neo4j.com/docs/operations-manual/current/clustering/setup-new-cluster/#causal-clustering-add-read-replica

Choosing a NoSQL database

I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?
I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.

How to make Riak data localized?

I'm designing a Riak cluster at the moment and wondering if it is possible to hint Riak that a specific bunch of keys should be placed on a single node of the cluster?
For example, there is some private data for the user, that only she is able to access. This data contains ~10k documents (too large to be kept in one key/document), and to serve one page, we need to retrieve ~100 of them. It would be better to keep the whole bunch on a single node + have the application on the same instance to make this faster.
AFAIK it is easy on Cassandra: just use OrderedPartitioner and keys like this: <hash(username)>/<private data key>. That way, almost all user keys will be kept on a single node.
One of the points of using Riak is that your data is replicated and evenly distributed throughout the cluster, thus improving your tolerance for network partitions and outages. Placing data on specific nodes goes against that goal and increases your vulnerability.

Resources