I'm trying to figure out if Cassandra's super columns are useful.
If I understand how Cassandra works (which could be wrong), if I want to read or update a super column, I have to read or write everything in the super column. This means I need to write some mapping code between my object(s) and my super column(s).
Wouldn't it be more efficient for me to simply serialize my object into a regular Cassandra column? It sounds to me like that's exactly what Cassandra does with super columns but they require extra steps.
If I want to read or update a super column, I have to read or write everything in the super column
Not quite: supercolumns are read as a unit, but subcolumns may be updated independently.
So yes, they do have a use, but no, it is not a very common one and in many cases you can indeed just serialize an object into a standard column.
Related
In Spark, is there a way of adding a column to a DataFrame by means of a join, but in a way that guarantees that the left hand side remains completely unchanged?
This is what I have looked at so far:
leftOuterJoin... but that risks duplicating rows, so one would have to be super-careful to make sure that there are no duplicate keys on the right. Not exactly robust or performant, if the only way to guarantee safety is to dedupe before the join.
There is a data structure that seems to guarantee no duplicate keys: PairRDD. That has a nice method of looking up a key in the key-value table: YYY.lookup("key") . Thus one might expect to be able to do .withColumn("newcolumn", udf((key:String) => YYY.lookup(key)).apply(keyColumn)) but it seems that udfs cannot do this because they apparently cannot access the sqlContext which is apparently needed for the lookup. If there were a way of using withColumn I would be extremely happy because it has the right semantics.
Many thanks in advance!
I have a 40 column RDBMS table which I am porting to Cassandra.
Using the estimator at http://docs.datastax.com/en/cassandra/2.1/cassandra/planning/architecturePlanningUserData_t.html
I created a excel sheet with column names, data types, size of each column etc.
The Cassandra specific overhead for each RDBMS row is a whopping 1KB when the actual data is only 192 bytes.
Since the overheads are proportional to number of columns, I thought it would be much better if I just create a UDT for the fields that are not part of the primary key. That way, I would incur the column overhead only once.
Also, I don't intend to run queries on inner fields of the UDT. Even if I did want that, Cassandra has very limited querying features that work on non PK fields.
Is this a good strategy to adopt? Are there any pitfalls? Are all these overheads easily eliminated by compression or some other internal operation?
On the surface, this isn't a bad idea at all. You are essentially abstracting your data by another level, but in a way that it is still manageable to meet your needs. It's actually good thinking.
I have a 40 column RDBMS table
This part slightly worries me. Essentially, you'd be creating a UDT with 40 properties. Not a huge deal in and of itself. Cassandra should handle that just fine.
But while you may not be querying on the inner fields of the UDT, you need to ask yourself how often you plan to update them. Cassandra stores UDTs as "frozen" types in a single column. This is important to understand for two reasons:
You cannot read a single property of a UDT without reading all properties of the UDT.
Likewise, you cannot update a single property in a UDT without rewriting all of them, either.
So you should keep that in mind while designing your application. As long as you won't be writing frequent updates to individual properties of the UDT, this should be a good solution for you.
Context:
Aggregation by key with potentially millions of rows by key.
Add features in row. To do that we have to know the previous row (by key and by timestamp). For the moment we used groupByKey and do the work on the Iterable.
We tried:
Add more memory to executor/driver
Change the number of partitions
Changing the memory allowed to executor/driver worked. It worked only for 10k or 100k rows by key. What about millions of rows by key that could happend in the future.
It seems that there is some work on that kind of issues : https://github.com/apache/spark/pull/1977
But it's specific for PySpark and not for the Scala API that we used currently
My questions are:
Is it better that I wait for new features that handle this kind of
issues knowing that I have to work specifically in PySpark?
Another solution would be to implement the workflow differently using some specific keys, values to handle my needs. Any design pattern for that. For example with the need to have the previsous row by key and by timestamp to add fetures?
I think the change in question just makes PySpark work more like the main API. You probably don't want to design a workflow that requires a huge number of values per key, no matter what. There isn't a fix other than designing it differently.
I haven't tried this, and am only fairly sure this behavior is guaranteed, but, maybe you can sortBy timestamp on the whole data set, and then foldByKey. You provide a function that merges a previous value into a next value. This should encounter the data by timestamp. So you see row t, t+1 each time, and each time can just return row t+1 after augmenting it how you like.
I understand that cassandra resolves writes conflicts based on every column's key-value pair's timestamp (last write wins). But is there a way we can override this behavior by manual intervention?
Thanks,
Chethan
No.
Cassandra only does LWW. This may seem simplistic, but Cassandra's Big Query-esque data model makes it less of an issue than in a pure key/value-store like Riak, for example. When all you have is an opaque value for a key you want to be able to do things like keeping conflicting writes so that you can resolve them later. Since Cassandra's rows aren't opaque, but more like a sorted map, LWW is almost always enough. With Cassandra you can add new cells to a row from multiple clients without having to worry about conflicts. It's only when multiple clients write to the same cell that there is an issue, but in that case you usually can (and you probably should) model your way around that.
To gain read performance, data model in Cassandra are denormalized to certain extend.
As denormalized produces duplicated record, how to avoid data anomalies?
As I know, there could be 2 ways to keep your data in different column family(CF) consistent:
Whenever you update row/column value in one CF, also update the
corresponding row/column in other CF.
Run some background batch
process to reconcile column values to make them consistent.
Which way you want to use, that may depends on how you did you model your data, and how strict is your application's requirement for data consistency among different CF.
jsfeng's answer is good on general principles, but a more specific answer is to use atomic batches: www.datastax.com/dev/blog/atomic-batches-in-cassandra-1-2