I have an application where I need to extract a large amount of data from a table in a MemSQL database. I would like to be able to run extractors that read the data from each leaf node in parallel. Is there a way to do this in MemSQL?
Yes. You can connect directly to the leaf nodes and read from the partition databases (the databases like db_0, see show partitions). This isn't usually recommended, but this should generally be fine as long as you're only reading from the leaves directly, and there's no DDL, clustering operations, etc happening at the same time.
Related
Consider a growing number of data, let's choose from two extreme choices:
Evenly distribute all data across all nodes in the cluster
We pack them to as few nodes as possible
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
However, some resources state that we shouldn't query all the nodes because that will slow down the query. Why would that slow the query? Isn't that just a normal scatter and gather? They even claim this hurts linear scalability as adding more nodes will further drag down the query.
(Maybe I am missing on how Cassandra performs the query, some background reference is appreciated).
On the contrary, some resources state that we should go with option 2 because it queries the least number of nodes.
Of course there is no black and white choices here; everything must have a tradeoff.
I want to know, what's the real difference between option 1 and option 2. Plus, regarding the network querying, why option 1 would be slow.
I prefer option 1 because as the volume of data grows, we can scatter it with all nodes, so that when each node is queried, it has the lowest load.
You definitely want to go with option #1. This is also preferable, in that new or replacement nodes will stream much faster than a cluster made of fewer, dense nodes.
However, some resources state that we shouldn't query all the nodes because that will slow down the query.
And those resources are absolutely correct. First of all, if you read through the resources which Alex posted above you'll discover how to build your tables so that your queries can be served by a single node. Running queries which only hit a single node is the best way around that problem.
Why would that slow the query?
Because in a distributed database environment, query time becomes network time. There are many people out there who like to run multi-key or unbound queries against Cassandra. When that happens, and the query is unable to find a single node with the data, Cassandra picks one node to designate as a "coordinator."
That node builds the result set with data from the other nodes. Which means in a 30 node cluster, that one node is now pulling data from the other 29. Assuming that these requests don't time-out, the likelihood that the coordinator will crash due to trying to manage too much data is very high.
The bottom line, is that this is one of those tradeoffs between a CA relational database and an AP partitioned row store. Build your tables to support your queries, store data together which is queried together, and Cassandra will perform just fine.
I'm current using DB2 and planning to use cassandra because as i know cassandra have a read performance greater than RDBMS.
May be this is a stupid question but I have experiment that compare read performance between DB2 and Cassandra.
Testing with 5 million records and same table schema.
With query SELECT * FROM customer. DB2 using 25-30s and Cassandra using 40-50s.
But query with where condition SELECT * FROM customer WHERE cusId IN (100,200,300,400,500) DB2 using 2-3s and Cassandra using 3-5ms.
Why Cassandra faster than DB2 with where condition? So i can't prove which database is greater with SELECT * FROM customer right?
FYI.
Cassandra: RF=3 and CL=1 with 3 nodes each node run on 3 computers (VM-Ubuntu)
DB2: Run on windows
Table schema:
cusId int PRIMARY KEY, cusName varchar
If you look at the types of problems that Cassandra is good at solving, then the reasons behind why unbound ("Select All") queries suck become quite apparent.
Cassandra was designed to be a distributed data base. In many Cassandra storage patterns, the number of nodes is greater than the replication factor (I.E., not all nodes contain all of the data). Therefore, limiting the number of network hops becomes essential to modeling high-performing queries. Cassandra performs very well with specific queries (which utilize the partition/clustering key structure), because it can quickly locate the node primarily responsible for the data.
Unbound queries (A.K.A. multi-key queries) incur the extra network time because a coordinator node is required. So one node acts as the coordinator, queries all other nodes, collates data, and returns the result set. Specifying a WHERE clause (with at least a partition key) and while using a "Token Aware" load balancing policy, performs well for two reasons:
A coordinator node is not required.
The node primarily responsible for the range is queried, returning the result set in a single netowrk hop.
tl;dr;
Querying Cassandra with an unbound query, causes it to incur a lot of extra processing and network time that it normally wouldn't have to do, had the query been specified with a WHERE clause.
Even as a troublesome query like a no-condition range query, 40-50s is pretty extreme for C*. Is the coordinator hitting GCs with the coordination? Can you include code used for your test?
When you make a select * vs millions of records, it wont fetch them all at once, it will grab the fetchSize at a time. If your just iterating through this, the iterator will actually block even if you used executeAsync initially. This means that every 10k (default) records it will issue a new query that you will block on. The serialized nature of this will take time just from a network perspective. http://docs.datastax.com/en/developer/java-driver/3.1/manual/async/#async-paging explains how to do it in a non-blocking way. You can use this to to kick off the next page fetch while processing the current which would help.
Decreasing the limit or fetch size could also help, since the coordinator may walk token ranges (parallelism is possible here but its heuristic is not perfect) one at a time until it has read enough. If it has to walk too many nodes to respond it will be slow, this is why empty tables can be very slow to do a select * on, it may serially walk every replica set. With 256 vnodes this can be very bad.
How to use the ByteOrderedPartitioner (BOP) to force specific key values to be partitioned according to a custom requirement. I want to force Cassandra to partition and replicate data according to custom requirements, without introducing a custom partitioner how far I can control this behavior and how ?
Overall: I want my data starting with particular ID to be at a predefined node because I know data will be accessed from that node heavily. Also like the data to be replicated to nearby nodes.
I want my data starting with particular ID to be at a predefined node because I know data will be accessed from that node heavily.
Looks like that you talk about data locality problem, which is really important in bigdata-like computations (Spark, Hadoop, etc.). But the general approach for that isn't to pin data to specific node, but just to move your whole computation to the data itself.
Pinning data to specific node may cause problems like:
what should you do if your node goes down?
how evenly will the data be distributed among the cluster? Will be there any hotspots/bottlenecks because of node over(under)-utilization?
how can you scale your cluster in future?
Moving computation to data has no issues with these questions, but the approach you going to choose - has.
Found the answer here...
http://www.mail-archive.com/user%40cassandra.apache.org/msg14997.html
Changing the setting "initial_token" in cassandra.yaml file we can let the nodes to be divided into key ranges and partitioning will choose the node which is going to save the first replication of the data and strategy class SimpleStrategy will add the replica to proceeding nodes so by arranging the nodes the way you want you can exploit the replication strategy.
I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?
I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.
I'm designing a Riak cluster at the moment and wondering if it is possible to hint Riak that a specific bunch of keys should be placed on a single node of the cluster?
For example, there is some private data for the user, that only she is able to access. This data contains ~10k documents (too large to be kept in one key/document), and to serve one page, we need to retrieve ~100 of them. It would be better to keep the whole bunch on a single node + have the application on the same instance to make this faster.
AFAIK it is easy on Cassandra: just use OrderedPartitioner and keys like this: <hash(username)>/<private data key>. That way, almost all user keys will be kept on a single node.
One of the points of using Riak is that your data is replicated and evenly distributed throughout the cluster, thus improving your tolerance for network partitions and outages. Placing data on specific nodes goes against that goal and increases your vulnerability.