Can someone explain and provide the document that explains the behavior of
select * from <keyspace.table>
Let's assume I have 5 node cluster, how does Cassandra DataStax Driver behave when such queries are being issued?
(Fetchsize was set to 500)
Is this a proper way to pull data ? Does it cause any performance issues?
No, that's really a very bad way to pull data. Cassandra shines when it fetches the data by at least partition key (that identifies a server that holds the actual data). When you are doing the select * from table, request is sent to coordinating node, that will need to pull all data from all servers and send via that coordinating node, overloading it, and most probably lead to the timeout if you have enough data in the cluster.
If you really need to perform full fetch of the data from cluster, it's better to use something like Spark Cassandra Connector that read data by token ranges, fetching the data directly from nodes that are holding the data, and doing this in parallel. You can of course implement the token range scan in Java driver, something like this, but it will require more work on your side, comparing to use of Spark.
Related
I am currently managing a percona xtradb cluster composed by 5 nodes, that hadle milions of insert every day. Write performance are very good but reading is not so fast, specially when i request a big dataset.
The record inserted are sensors time series.
I would like to try apache cassandra to replace percona cluster, but i don't understand how data reading works. I am looking for something able to split query around all the nodes and read in parallel from more than one node.
I know that cassandra sharding can have shard replicas.
If i have 5 nodes and i set a replica factor of 5, does reading will be 5x faster?
Cassandra read path
The read request initiated by a client is sent over to a coordinator node which checks the partitioner what are the replicas responsible for the data and if the consistency level is met.
The coordinator will check is it is responsible for the data. If yes, will satisfy the request. If no, it will send the request to fastest answering replica (this is determined using the dynamic snitch). Also, a request digest is sent over to the other replicas.
The node will compare the returning data digests and if all are the same and the consistency level has been met, the data is returned from the fastest answering replica. If the digests are not the same, the coordinator will issue some read repair operations.
On the node there are a few steps performed: check row cache, check memtables, check sstables. More information: How is data read? and ReadPathForUsers.
Load balancing queries
Since you have a replication factor that is equal to the number of nodes, this means that each node will hold all of your data. So, when a coordinator node will receive a read query it will satisfy it from itself. In particular(if you would use a LOCAL_ONE consistency level, the request will be pretty fast).
The client drivers implement the load balancing policies, which means that on your client you can configure how the queries will be spread around the cluster. Some more reading - ClientRequestsRead
If i have 5 nodes and i set a replica factor of 5, does reading will be 5x faster?
No. It means you will have up to 5 copies of the data to ensure that your query can be satisfied when nodes are down. Cassandra does not divide up the work for the read. Instead it tries to force you to design your data in a way that makes the reads efficient and fast.
Best way to read cassandra is by making sure that each query you generate hits cassandra partition. Which means the first part of your simple primary(x,y,z) key and first bracket of compound ((x,y),z) primary key are provided as query parameters.
This goes back to cassandra table design principle of having a table design by your query needs.
Replication is about copies of data and Partitioning is about distributing data.
https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archPartitionerAbout.html
some references about cassandra modelling,
https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key
https://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
it is recommended to have 100 MB partitions but not compulsory.
You can use cassandra-stress utility to have look report of how your reads and writes look.
How can I export data, over a period of time (like hourly or daily) or updated records from a Cassandra database? It seems like using an index with a date field might work, but I definitely get timeouts in my cqlsh when I try that by hand, so I'm concerned that it's not reliable to do that.
If that's not the right way, then how do people get their data out of Cassandra and into a traditional database (for analysis, querying with JOINs, etc..)? It's not a java shop, so using Spark is non-trivial (and we don't want to change our whole system to use Spark instead of cassandra directly). Do I have to read sstables and try to keep track of them that way? Is there a way to say "get me all records affected after point in time X" or "get me all changes after timestamp X" or something similar?
It looks like Cassandra is really awesome at rapidly reading and writing individual records, but beyond that Cassandra seems to not be the right tool if you want to pull its data into anything else for analysis or warehousing or querying...
Spark is the most typical to do exactly that (as you say). It does it efficiently and is used often so pretty reliable. Cassandra is not really designed for OLAP workloads but things like spark connector help bridge the gap. DataStax Enterprise might have some more options available to you but I am not sure their current offerings.
You can still just query and page through the whole data set with normal CQL queries, its just not as fast. You can even use ALLOW FILTERING just be wary as its very expensive and can impact your cluster (creating a separate dc for the workload and using LOCOL_CL queries against it helps). You will probably also in that scenario add a < token() and > token() to the where clause to split up the query and prevent too much work on any one coordinator. Organizing your data so that this query is more efficient would be strongly recommended (ie if doing time slices, put things in a partition bucketed by time and clustering key timeuuids so its sequential read for each part of time).
Kinda cheesy sounding but the CSV dump from cqlsh is actually fast and might work for you if your data set is small enough.
I would not recommend going to the sstables directly unless you are familiar with internals and using hadoop or spark.
I'm current using DB2 and planning to use cassandra because as i know cassandra have a read performance greater than RDBMS.
May be this is a stupid question but I have experiment that compare read performance between DB2 and Cassandra.
Testing with 5 million records and same table schema.
With query SELECT * FROM customer. DB2 using 25-30s and Cassandra using 40-50s.
But query with where condition SELECT * FROM customer WHERE cusId IN (100,200,300,400,500) DB2 using 2-3s and Cassandra using 3-5ms.
Why Cassandra faster than DB2 with where condition? So i can't prove which database is greater with SELECT * FROM customer right?
FYI.
Cassandra: RF=3 and CL=1 with 3 nodes each node run on 3 computers (VM-Ubuntu)
DB2: Run on windows
Table schema:
cusId int PRIMARY KEY, cusName varchar
If you look at the types of problems that Cassandra is good at solving, then the reasons behind why unbound ("Select All") queries suck become quite apparent.
Cassandra was designed to be a distributed data base. In many Cassandra storage patterns, the number of nodes is greater than the replication factor (I.E., not all nodes contain all of the data). Therefore, limiting the number of network hops becomes essential to modeling high-performing queries. Cassandra performs very well with specific queries (which utilize the partition/clustering key structure), because it can quickly locate the node primarily responsible for the data.
Unbound queries (A.K.A. multi-key queries) incur the extra network time because a coordinator node is required. So one node acts as the coordinator, queries all other nodes, collates data, and returns the result set. Specifying a WHERE clause (with at least a partition key) and while using a "Token Aware" load balancing policy, performs well for two reasons:
A coordinator node is not required.
The node primarily responsible for the range is queried, returning the result set in a single netowrk hop.
tl;dr;
Querying Cassandra with an unbound query, causes it to incur a lot of extra processing and network time that it normally wouldn't have to do, had the query been specified with a WHERE clause.
Even as a troublesome query like a no-condition range query, 40-50s is pretty extreme for C*. Is the coordinator hitting GCs with the coordination? Can you include code used for your test?
When you make a select * vs millions of records, it wont fetch them all at once, it will grab the fetchSize at a time. If your just iterating through this, the iterator will actually block even if you used executeAsync initially. This means that every 10k (default) records it will issue a new query that you will block on. The serialized nature of this will take time just from a network perspective. http://docs.datastax.com/en/developer/java-driver/3.1/manual/async/#async-paging explains how to do it in a non-blocking way. You can use this to to kick off the next page fetch while processing the current which would help.
Decreasing the limit or fetch size could also help, since the coordinator may walk token ranges (parallelism is possible here but its heuristic is not perfect) one at a time until it has read enough. If it has to walk too many nodes to respond it will be slow, this is why empty tables can be very slow to do a select * on, it may serially walk every replica set. With 256 vnodes this can be very bad.
I am trying to sync my Spark database on S3 with an older Oracle database via daily ETL Spark job. I am trying to understand just what Spark does when it connects to a RDS like Oracle to fetch data.
Does it only grab the data that at the time of Spark's request to the DB (i.e. if it fetches data from an Oracle DB at 2/2 17:00:00, it will only grab data UP to that point in time)? Essentially saying that any new data or updates at 2/2 17:00:01 will not be obtained from the data fetch?
Well, it depends. In general you have to assume that this behavior is non-deterministic, unless explicitly ensured by your application and database design.
By default Spark will fetch data every time you execute an action on the corresponding Spark dataset. It means that every execution might see different state of your database.
This behavior can be affected by multiple factors:
Explicit caching and possible cache evictions.
Implicit caching with shuffle files.
Exacted set of parameters you use with JDBC data source.
In the first two cases Spark can reuse already fetched data without going back to the original data source. The third one is much more interesting. By default Spark fetches data using a single transaction but there methods which enable parallel reads based on column ranges or predicates. If one of these is used Spark will fetch data using multiple transactions, and each one can observe different state of your database.
If consistent point-in-time semantics is required you have basically two options:
Use immutable, append-only and timestamped records in your database and issue timestamp dependent queries from Spark.
Perform consistent database dumps and use these as a direct input to your Spark jobs.
While the first approach is much more powerful it is much harder to implement if you're working with per-existing architecture.
How to use the ByteOrderedPartitioner (BOP) to force specific key values to be partitioned according to a custom requirement. I want to force Cassandra to partition and replicate data according to custom requirements, without introducing a custom partitioner how far I can control this behavior and how ?
Overall: I want my data starting with particular ID to be at a predefined node because I know data will be accessed from that node heavily. Also like the data to be replicated to nearby nodes.
I want my data starting with particular ID to be at a predefined node because I know data will be accessed from that node heavily.
Looks like that you talk about data locality problem, which is really important in bigdata-like computations (Spark, Hadoop, etc.). But the general approach for that isn't to pin data to specific node, but just to move your whole computation to the data itself.
Pinning data to specific node may cause problems like:
what should you do if your node goes down?
how evenly will the data be distributed among the cluster? Will be there any hotspots/bottlenecks because of node over(under)-utilization?
how can you scale your cluster in future?
Moving computation to data has no issues with these questions, but the approach you going to choose - has.
Found the answer here...
http://www.mail-archive.com/user%40cassandra.apache.org/msg14997.html
Changing the setting "initial_token" in cassandra.yaml file we can let the nodes to be divided into key ranges and partitioning will choose the node which is going to save the first replication of the data and strategy class SimpleStrategy will add the replica to proceeding nodes so by arranging the nodes the way you want you can exploit the replication strategy.