[Question posted by a user on YugabyteDB Community Slack]
Does YugabyteDB use hash-range partitioning or just plain old hash partitioning? seems like hash-range partitioning would speed up the addition and removal of nodes.
The linked post explains the advantages of hash-range partitioning over hash-modulo partitioning for repartitioning:
With hash-modulo partitioning, repartitioning is a “global” operation:
each output partition depends on every input partition. With
hash-range partitioning, repartitioning is “local” and has a much
narrow set of dependencies. This can have really meaningful
consequences for reliability and performance. For example, suppose we
lose a machine that holds one of our output partitions. If we’re using
hash-modulus partitioning, we’ll have to refetch the data from all our
input partitions; with hash-range partitioning, we’ll only have to
contact one or two.
Yes, this is what YugabyteDB does. The distribution to tablet is a range of hash value (hash value from 0 to 65535).
Example:
create table demo (a int primary key) split into 3 tablets;
Will create 3 tablets split at [0x0000, 0x5555) , [0x5555, 0xAAAA) , [0xAAAA, 0xFFFF).
Related
I am building a near real time/ microbatch data application with Cassandra as the lookup store. Each incremental run has ~40K records, while the Cassandra table has about 150 million records. In each run, I need to lookup the id field and get some attributes from Cassandra. These lookups can be random (not any time/ region/ country dependency), so there is no clear partitioning scheme.
How should I try to partition the Cassandra table to ensure decent/ good performance (for microbatches running every 15-30 mins)?
Apart from partitioning, any other tips?
joinWithCassandraTable and leftJoinWithCassandraTable functions were specifically designed for efficient data lookup in Cassandra from Spark jobs. It performs fetching of data by primary or partition key, and because it's executed by multiple executors in parallel, it could be fast (although ~40K could still take time, but it depends on size of your Cassandra and Spark clusters). See the SCC's documentation for detailed information how to use it - but remember, that these functions are available only in RDD API. The DataStax's version of connector has support for so-called "DirectJoin" - efficient joins with Cassandra in the DataFrame API.
Regarding partitioning - it depends on how do you perform lookup - you have 1 record in Cassandra matching one record in Spark? If yes, then just use this ID as primary key (it's equal to partition key in this case).
I've understood difference b/w Cassandra Partition key, Composite key, Clustering key. But not finding enough information to understand how partition is handled in cassandra.
In cassandra, range of partition keys are stored on a node like a partition/shard. Is my understanding is correct or not..?
Is each partition key has different file(at the system level) in DB..? If so, won't the reads be slower..?
If each partition key is not having different file in DB. How it's handled..?
Data is stored in Cassandra in wide rows called partitions. Each row has a partition key used for identifying that partition. For distributing the data across the cluster, Cassandra is using partitioners which are basically computing hashes of the partition key and the data is distributed across the cluster based on these values. The default partitioner in Cassandra is Murmur3Partitioner.
At OS level, the data is stored in sstables files. A partition can be spread across many sstables. That's why you also need compaction, which is the process of consolidating those sstables, so your partitions won't be spread across a lot of sstables. Reducing the number of sstables a partitions is spread across, will also improve read time. It's worth noting that sstables are immutable.
I suggest reading this, especially "How Cassandra reads and writes data".
After going through multiple websites, partition key in cassandra is responsible for identifying the node in the cluster where it stores data. But I don't understand on what parameter number of partitions are created(like keyspace responsible for Replication Factor) in cassandra..! or it creates partitions based on murmur3 without being able to specifying partitions explicitly
Thanks in Advance
Cassandra by default uses partitioner based on Murmur3 hash that generates values in range from -2^63 to 2^63-1. Each node in cluster is responsible for particular range of hash values, and data with partition key hashed to values in this range go to that node(s). I recommend to read documentation about Cassandra/DSE architecture - it will make things easier to understand.
Not sure if the question's title sounds crazy but I thought about this and I'd like to check the "validity" of the pro's and con's I imagine.
The ideal C* query in "production" targets only one partition, possibly with additional restrictions on the clustering keys. A data model should be designed with that in mind.
However, for analytics jobs, e.g. using Spark, the queries would not work like that: "searching" for specific partitions is often needed (and I could not find a way to do that properly with SparkSQL and the dataframe API) and it should not work like this: a Spark job should target many partitions to spread over all the co-located Spark/Cassandra nodes.
My data model works in such a way that acquiring my data in real time inserts partitions as a whole. My partitions are "atomic": a large analytics job with Spark will mainly correlate data within one partition (which is good as it allows data locality for the Spark executor) but my main problem is to find on which partitions I want to operate.
So, what about duplicating my partition key and have it as a clustering key as well? This would allow me to build a SASI index on it and have the "best of both worlds" just at the cost of the additional storage.
Would this be a sound strategy?
We are exploring SPARK for cassandra in order to over come limitations with CQL.
We were initially restricted to CQL but faced few road blocks/hurdles over RDBMS. To name a few as below
For comparing >(Greater than) and < (Less than) on a column, we are restricted to have the columns in Clustering key. Even If I have a column in Clustering, I should still provide the Partition key to do < or > on clustering key.
Can't check for NULL on any column value
In order to query on any column other Partition key, we have to create index on that column
ORDER BY a column which isn't a CLUSTERING KEY
GROUP BY Limitations
Join Tables
I am a newbie with cassandra and end up in revisiting my schema often due to the limitations.
Hence similar to HIVE/PIG for HDFS, What additional benefits does Spark give over CQL ?
CQL is not a replacement for SQL. It is really designed for pulling out values from a few, usually one, partition key, and as you pointed out, does not do any sort of aggregation, grouping, very limited sorting, etc. (though Cassandra 3.0 will have UDFs and UDAs).
Here is what Spark offers over CQL:
General aggregation and querying via DataFrames and SQL, including JOINs, GROUP BY, ORDER BY, and UDFs
Significantly faster queries -- orders of magnitude faster -- if you cache the Cassandra data in memory using sqlContext.cacheTable
Integrated machine learning, statistics, graph processing, and virtually any kind of distributed computation you can imagine, using Scala, Java, Python, and R APIs
Ability to ETL in and out of Cassandra tables from and to many other data sources - including various HDFS formats, Amazon S3, DBMSes, Mongo, and most other databases today
Spark is really a completely different beast from CQL. It offers complex analytics over vast quantities of data, CQL doesn't. However, there are some limitations as well:
Spark is not good at highly concurrent queries. For that, you want to keep queries simple and use CQL to pull out a very small amount of data.
Caching data in Spark is not HA and cannot update as you write new data into C*
If you want very fast analytical queries over Cassandra with support for updates and no need to cache, then check out my project http://github.com/tuplejump/FiloDB.