Sparse matrix using column store on MemSQL - singlestore

I am new to column store db family and some of the concepts are not yet completely clear to me. I want to use MemSQL to store sparse matrix.
The table would look something like this:
r_id INT,
c_id INT,
cell_data VARCHAR(10),
The Queries:
SELECT c_id, cell_data FROM matrix WHERE r_id=<val>; i.e. whole row
SELECT r_id, cell_data FROM matrix WHERE c_id=<val>; i.e. whole column
SELECT cell_data FROM matrix WHERE r_id=<val1> AND c_id=<val2>; i.e. one cell
UPDATE matrix SET cell_data=<val> WHERE r_id=<val1> AND c_id=<val2>;
INSERT INTO matrix VALUES (<v1>, <v2>, <v3>);
The queries 1 and 2 are about equally frequent and 3, 4 and 5 are also equally frequent. One of Q1,2 are equally frequent as one of Q3,4,5 (i.e. Q1,2:Q3,4,5 ~= 1:1).
I do realize that inserting into column store one row at a time creates Row segment group for each insert and thus degrading performance. I cannot batch the inserts. Also I cannot use in-memory row store (the matrix is too big).
I have three questions:
Does the issue with single row inserts concern updates too if only cell_data is changed (i.e. Q4)?
Would it be possible to have in-memory row table in which I would do INSERT (?and UPDATE?) operations and periodically batch the contents to column table?
How would I perform Q1,2 if I need most recent data (?UNION ALL?)?
Is it possible avoid executing Q3 for both tables (?which would mean two round trips?)?
I am concerned by execution speed of Q1 and Q2. Is the Clustered key optimal for those. I am not sure how the records would be stored with table above.

Yes, single-row updates also perform poorly - they are essentially a delete and an insert.
Yes, and in fact we automatically do this behind the scenes - the most recently inserted data (if it is too small a number of rows to be a good columnar segment) is kept in an in-memory rowstore form, and read queries are essentially looking at a UNION ALL of that data and the column-oriented data. We then batch up this data to write into column-oriented form.
If that doesn't work well enough, depending on your workload, you may benefit from explicitly keeping some of your data in a rowstore table instead of relying on the above behavior, in which case:
2a. yes, to see the most recent data you would use UNION ALL
2b. the data could be in either table, so you would have to query both (like for Q1,2, using UNION ALL works). This does not do two round trips, just one.
You can either order by r or c first in the columnstore key - r in your current schema. This makes queries for a row efficient, but queries for a column are going to be very inefficient, they may have to scan basically the full table (depending on the patterns in your data). Unfortunately columnstore tables do not support using multiple keys, so there is no good way to solve this. One potential hacky solution is to maintain two copies of your table, one with key (r, c) and one with key (c, r) - this is essentially manually maintaining two indexes.
Based on the workload you're describing, it sounds like you are doing many single-row queries (Q3,4,5, which is 50% of the workload), which rowstore is much better suited for than columnstore (see Unfortunately, if it doesn't fit in memory, there isn't really a good way around this other than perhaps to add more memory.


Any downside to 'redundant' clustering column?

I've noticed that changing a regular Cassandra column to a clustering column can significantly reduce the size of the table in some circumstances.
For this example table:
state TINYINT (C)
value DOUBLE
The size of 100000 rows is estimated at 3.9 MB if state is an ordinary column, or 2.4 MB if state is a clustering column (estimated using the method in DataStax course DS220).
If you look at how the data is physically stored it isn't hard to see why this difference exists. In the former case there are two internal cells per timestamp - one for state and one for value. In the latter case value is incorporated into the cell key so there is just one cell per timestamp, and the timestamp (part of the cell key) is stored only once.
The second clustering column does not create any new restrictions on what can be queried. SELECT * FROM table WHERE id=? AND time>=? AND time<? is still fine.
It seems like a win-win situation. Are there any downsides, in particular, performance-wise?
(All I can think of is that if state is a regular column then it can be omitted from an INSERT and the state internal cell will never be created. I imagine if state is a regular column and usually omitted then the table will be very slightly smaller than if state is a clustering column.)
Additional comments
It's worth noting that in the definition above you can't filter by state without an equality filter on time, making it not very useful for filtering state. And if you put the state column above time to resolve this then yes you can filter by state and time inequality, but if you want all states (IN clause) then the rows are returned ordered by state first, then time, which again is not very useful.
I would think the main difference here is that if it's a clustering column it must be provided with INSERTs as it's part of the primary key. Also, as it's part of the primary key, you can't update it either, which could be problematic for some tables. If you don't have any concerns about either of those two, I don't see any reason why you couldn't add it.
1) You create a row per state. Your data model would have to realize and understand that. You could potentially create two rows with the different states for the same id, time, which the original model disallows.
2) If you delete, you'll either need to specify state or you'll be creating Range Tombstones (range deletes, because you're deleting all rows for a given id and time, but it may be a range of states). Range tombstones are especially expensive (on the read path) in 2.1, and aren't properly accounted for in TombstoneOverwhelming exception handlers until a fairly recent version of Cassandra, so avoiding range tombstones is usually a good idea, unless you actually need them.

Cassandra - Same partition key in different tables - when it is right?

I modeled my Cassandra in a way that i have couple of tables with the same partition key - Uuid.
Each table has it's partition key and others column representing data for specific query i would like to ask.
For example - 1 table have Uuid and column regarding it's status (no other clustering keys in this table) and table 2 will contain the same Uuid (Also without clustering keys) but with different columns representing the data for this Uuid.
Is it the right modeling? Is it wrong to duplicate the same partition key around tables in order to group each table to hold relevant column for specific use case? or it preferred to use only 1 table and query them and taking the relevant data for the specific use case in the code?
There's nothing wrong with this modeling. Whether it is better, or worse, than the obvious alternative of having just one table with both pieces of data, depends on your workload:
For example, if you commonly need to read both status and data columns of the same uuid, then these reads will be more efficient if both things are in the same table, which only needs to be looked up once. If you always read just one but not both, then reads will be more efficient from separate tables. Also, if this workload is not read-mostly but rather write-mostly, then writing to just one table instead of two will be more efficient.

Trying to visual how wide and skinny rows are layed out

Can someone give and show me how the data is layed out when you design your tables for wide vs. skinny rows.
I'm not sure I fully grasp how the data is spread out with a "wide" row.
Is there a difference in how you can fetch the data or will it be the same i.e. if it is ordered it doesn't matter if the data is vertical (skinny) or horizontally (wide) organized.
Is a table considered with if the primary key consists of more than one column?
Or table will have wide rows only if the partition key is a composite partition key?
Wide... Skinny... Terms that make your head explode... I prefer to oversimplify the thing as such:
All the tables have wide rows
You simply need to take care of how wide the rows gets
This allows me to think this as follow (mangling a bit the C* terminology):
Number of RECORDS in a partition
1 <--------------------------------------- ... 2Billion
^ ^
Skinny rows wide rows
The lesser records in a partition, the skinner is the "partition", and vice-versa.
When designing for C* I always keep in mind a couple of things:
I want to use "skinny partitions" when my data can be fetched with one query and it is fully contained in one record of one partition. Typical example is something along SELECT * FROM table WHERE username = 'xmas79'; where the table has a primary key in the form of PRIMARY KEY (username)that let me get all the data belonging to a particular username.
I want to use "wide rows" when my data can be fetched with one query and it is fully contained on multiple records of one partition. Typical examples are range queries like SELECT * FROM table WHERE sensor = 'pressure' AND time >= '2016-09-22';, where the table has a primary key in the form of PRIMARY KEY (sensor, time).
So, first approach for one shot queries, second approach for range queries. Beware that this second approach have the (major) drawback that you can keep adding data to the partition, and it will get wider and wider, hurting performances.
In order to control how wide your partitions are, you need to add something to the partition key. In the sensor example above, if your don't violate your requirements of course, you can "group" some measurements by date, eg you split the measures in a day-by-day groups, making the primary key like PRIMARY KEY ((sensor, day), time), where the partition key was transformed to (sensor, day). By this approach, you have full (well, let's say good at least) control on the wideness of your partitions.
You only need to find a good compromise between your query capabilities and the desired performance.
I suggest these three readings for further investigation on the details:
Wide Rows in Cassandra CQL
Does CQL support dynamic columns / wide rows?
CQL3 for Cassandra experts
Beware that in the 1. there's a mistake in the second to last picture: the primary key should be
PRIMARY KEY ((user_id, tweet_id))
with double parenthesis around the columns instead of one.

Cassandra schema design: should more columns go into partition vs. cluster?

In my case I have a table structure like this:
table_1 {
entity_uuid text
,fk1_uuid text
,fk2_uuid text
,int_timestamp bigint
,cnt counter
,primary key (entity_uuid, fk1_uuid, fk2_uuid, int_timestamp)
The text columns are made up of random strings. However, only entity_uuid is truly random and evenly distributed. fk1_uuid and fk2_uuid have much lower cardinality and may be sparse (sometimes fk1_uuid=null or fk2_uuid=null).
In this case, I can either define only entity_uuid as the partition key or entity_uuid, fk1_uuid, fk2_uuid combination as the partition key.
And this is a LOOKUP-type of table, meaning we don't plan to do any aggregations/slice-dice based on this table. And the rows will be rotated out since we will be inserting with TTL defined for each row.
Can someone enlighten me:
What is the downside of having too many partition keys with very few
rows in each? Is there a hit/cost on the storage engine level?
My understanding is the cluster keys are ALWAYS sorted. Does that mean having text columns in a cluster will always incur tree
balancing cost?
Well you can tell where my heart lies by now. However, when all rows in a partition all TTL-ed out, that partition still lives, or is there a way they will be removed by the DB engine as well?
The major and possibly most significant difference between having big partitions and small partitions is the ability to do range scans. If you want to be able to do scan queries like
SELECT * FROM table_1 where entity_id = x and fk1_uuid > something
Then you'll need to have the clustering column for performance, otherwise this query would be difficult (a multi-get at best, full table scan at worst.) I've never heard of any cases where having too many partitions is a drag on performance but having too wide a partition (ie lots of clustering column values) can cause issues when you get into the 1B+ cell range.
In terms of the cost of clustering, it is basically free at write time (in memory sort is very very fast) but you can incur costs at read time as partitions become spread amongst various SSTables. Small partitions which are written once will not occur the merge penalty since they will most likely only exist in 1 SSTable.
TTL'd partitions will be removed but be sure to read up on GC_GRACE_SECONDS to see how Cassandra actually deals with removing data.
Everything is dependent on your read/write pattern
No Range Scans? No need for clustering keys
Yes Range Scans? Clustering keys a must

"Capped collections" in Cassandra

Cassandra doesn't have capped collections (or row size limits), but one way of simulating it is to use an offline mapreduce job clean up extra entries. Would it be better to have a second table that stores row counts for primary keys in another table? The downside is that you have to scan through the entire row_count table since counters aren't indexable. Or would it be faster to just scan over the backing table with the real data?
Or is there another technique I should look into?
Edit: I found this Columns count vs counter column performance. Row counts go over all the data, so I'm leaning away from that.
